Message ID | 20221027200019.4106375-1-sdf@google.com (mailing list archive) |
---|---|
Headers | show |
Series | xdp: hints via kfuncs | expand |
Stanislav Fomichev wrote: > This is an RFC for the alternative approach suggested by Martin and > Jakub. I've tried to CC most of the people from the original discussion, > feel free to add more if you think I've missed somebody. > > Summary: > - add new BPF_F_XDP_HAS_METADATA program flag and abuse > attr->prog_ifindex to pass target device ifindex at load time > - at load time, find appropriate ndo_unroll_kfunc and call > it to unroll/inline kfuncs; kfuncs have the default "safe" > implementation if unrolling is not supported by a particular > device > - rewrite xskxceiver test to use C bpf program and extend > it to export rx_timestamp (plus add rx timestamp to veth driver) > > I've intentionally kept it small and hacky to see whether the approach is > workable or not. Hi, I need RX timestamps now as well so was going to work on some code next week as well. My plan was to simply put a kptr to the rx descriptor in the xdp buffer. If I can read the rx descriptor I can read the timestamp, the rxhash and any other metadata the NIC has completed. All the drivers I've looked at stash the data here. I'll inline pro/cons compared to this below as I see it. > > Pros: > - we avoid BTF complexity; the BPF programs themselves are now responsible > for agreeing on the metadata layout with the AF_XDP consumer Same no BTF is needed in kernel side. Userspace and BPF progs get to sort it out. > - the metadata is free if not used Same. > - the metadata should, in theory, be cheap if used; kfuncs should be > unrolled to the same code as if the metadata was pre-populated and > passed with a BTF id Same its just a kptr at this point. Also one more advantage would be ability to read the data without copying it. > - it's not all or nothing; users can use small subset of metadata which > is more efficient than the BTF id approach where all metadata has to be > exposed for every frame (and selectively consumed by the users) Same. > > Cons: > - forwarding has to be handled explicitly; the BPF programs have to > agree on the metadata layout (IOW, the forwarding program > has to be aware of the final AF_XDP consumer metadata layout) Same although IMO this is a PRO. You only get the bytes you need and care about and can also augment it with extra good stuff so calculation only happen once. > - TX picture is not clear; but it's not clear with BTF ids as well; > I think we've agreed that just reusing whatever we have at RX > won't fly at TX; seems like TX XDP program might be the answer > here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata > back into the kernel) Agree TX is not addressed. A bit of extra commentary. By exposing the raw kptr to the rx descriptor we don't need driver writers to do anything. And can easily support all the drivers out the gate with simple one or two line changes. This pushes the interesting parts into userspace and then BPF writers get to do the work without bother driver folks and also if its not done today it doesn't matter because user space can come along and make it work later. So no scattered kernel dependencies which I really would like to avoid here. Its actually very painful to have to support clusters with N kernels and M devices if they have different features. Doable but annoying and much nicer if we just say 6.2 has support for kptr rx descriptor reading and all XDP drivers support it. So timestamp, rxhash work across the board. To find the offset of fields (rxhash, timestamp) you can use standard BTF relocations we have all this machinery built up already for all the other structs we read, net_devices, task structs, inodes, ... so its not a big hurdle at all IMO. We can add userspace libs if folks really care, but its just a read so I'm not even sure that is helpful. I think its nicer than having kfuncs that need to be written everywhere. My $.02 although I'll poke around with below some as well. Feel free to just hang tight until I have some code at the moment I have intel, mellanox drivers that I would want to support. > > Cc: Martin KaFai Lau <martin.lau@linux.dev> > Cc: Jakub Kicinski <kuba@kernel.org> > Cc: Willem de Bruijn <willemb@google.com> > Cc: Jesper Dangaard Brouer <brouer@redhat.com> > Cc: Anatoly Burakov <anatoly.burakov@intel.com> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com> > Cc: Maryam Tahhan <mtahhan@redhat.com> > Cc: xdp-hints@xdp-project.net > Cc: netdev@vger.kernel.org > > Stanislav Fomichev (5): > bpf: Support inlined/unrolled kfuncs for xdp metadata > veth: Support rx timestamp metadata for xdp > libbpf: Pass prog_ifindex via bpf_object_open_opts > selftests/bpf: Convert xskxceiver to use custom program > selftests/bpf: Test rx_timestamp metadata in xskxceiver > > drivers/net/veth.c | 31 +++++ > include/linux/bpf.h | 1 + > include/linux/btf.h | 1 + > include/linux/btf_ids.h | 4 + > include/linux/netdevice.h | 3 + > include/net/xdp.h | 22 ++++ > include/uapi/linux/bpf.h | 5 + > kernel/bpf/syscall.c | 28 ++++- > kernel/bpf/verifier.c | 60 +++++++++ > net/core/dev.c | 7 ++ > net/core/xdp.c | 28 +++++ > tools/include/uapi/linux/bpf.h | 5 + > tools/lib/bpf/libbpf.c | 1 + > tools/lib/bpf/libbpf.h | 6 +- > tools/testing/selftests/bpf/Makefile | 1 + > .../testing/selftests/bpf/progs/xskxceiver.c | 43 +++++++ > tools/testing/selftests/bpf/xskxceiver.c | 119 +++++++++++++++--- > tools/testing/selftests/bpf/xskxceiver.h | 5 +- > 18 files changed, 348 insertions(+), 22 deletions(-) > create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c > > -- > 2.38.1.273.g43a17bfeac-goog >
On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote: > A bit of extra commentary. By exposing the raw kptr to the rx > descriptor we don't need driver writers to do anything. > And can easily support all the drivers out the gate with simple > one or two line changes. This pushes the interesting parts > into userspace and then BPF writers get to do the work without > bother driver folks and also if its not done today it doesn't > matter because user space can come along and make it work > later. So no scattered kernel dependencies which I really > would like to avoid here. Its actually very painful to have > to support clusters with N kernels and M devices if they > have different features. Doable but annoying and much nicer > if we just say 6.2 has support for kptr rx descriptor reading > and all XDP drivers support it. So timestamp, rxhash work > across the board. IMHO that's a bit of wishful thinking. Driver support is just a small piece, you'll have different HW and FW versions, feature conflicts etc. In the end kernel version is just one variable and there are many others you'll already have to track. And it's actually harder to abstract away inter HW generation differences if the user space code has to handle all of it. > To find the offset of fields (rxhash, timestamp) you can use > standard BTF relocations we have all this machinery built up > already for all the other structs we read, net_devices, task > structs, inodes, ... so its not a big hurdle at all IMO. We > can add userspace libs if folks really care, but its just a read so > I'm not even sure that is helpful. > > I think its nicer than having kfuncs that need to be written > everywhere. My $.02 although I'll poke around with below > some as well. Feel free to just hang tight until I have some > code at the moment I have intel, mellanox drivers that I > would want to support. I'd prefer if we left the door open for new vendors. Punting descriptor parsing to user space will indeed result in what you just said - major vendors are supported and that's it.
On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote: > > A bit of extra commentary. By exposing the raw kptr to the rx > > descriptor we don't need driver writers to do anything. > > And can easily support all the drivers out the gate with simple > > one or two line changes. This pushes the interesting parts > > into userspace and then BPF writers get to do the work without > > bother driver folks and also if its not done today it doesn't > > matter because user space can come along and make it work > > later. So no scattered kernel dependencies which I really > > would like to avoid here. Its actually very painful to have > > to support clusters with N kernels and M devices if they > > have different features. Doable but annoying and much nicer > > if we just say 6.2 has support for kptr rx descriptor reading > > and all XDP drivers support it. So timestamp, rxhash work > > across the board. > > IMHO that's a bit of wishful thinking. Driver support is just a small > piece, you'll have different HW and FW versions, feature conflicts etc. > In the end kernel version is just one variable and there are many others > you'll already have to track. > > And it's actually harder to abstract away inter HW generation > differences if the user space code has to handle all of it. I've had the same concern: Until we have some userspace library that abstracts all these details, it's not really convenient to use. IIUC, with a kptr, I'd get a blob of data and I need to go through the code and see what particular type it represents for my particular device and how the data I need is represented there. There are also these "if this is device v1 -> use v1 descriptor format; if it's a v2->use this another struct; etc" complexities that we'll be pushing onto the users. With kfuncs, we put this burden on the driver developers, but I agree that the drawback here is that we actually have to wait for the implementations to catch up. Jakub mentions FW and I haven't even thought about that; so yeah, bpf programs might have to take a lot of other state into consideration when parsing the descriptors; all those details do seem like they belong to the driver code. Feel free to send it early with just a handful of drivers implemented; I'm more interested about bpf/af_xdp/user api story; if we have some nice sample/test case that shows how the metadata can be used, that might push us closer to the agreement on the best way to proceed. > > To find the offset of fields (rxhash, timestamp) you can use > > standard BTF relocations we have all this machinery built up > > already for all the other structs we read, net_devices, task > > structs, inodes, ... so its not a big hurdle at all IMO. We > > can add userspace libs if folks really care, but its just a read so > > I'm not even sure that is helpful. > > > > I think its nicer than having kfuncs that need to be written > > everywhere. My $.02 although I'll poke around with below > > some as well. Feel free to just hang tight until I have some > > code at the moment I have intel, mellanox drivers that I > > would want to support. > > I'd prefer if we left the door open for new vendors. Punting descriptor > parsing to user space will indeed result in what you just said - major > vendors are supported and that's it.
Stanislav Fomichev wrote: > On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote: > > > A bit of extra commentary. By exposing the raw kptr to the rx > > > descriptor we don't need driver writers to do anything. > > > And can easily support all the drivers out the gate with simple > > > one or two line changes. This pushes the interesting parts > > > into userspace and then BPF writers get to do the work without > > > bother driver folks and also if its not done today it doesn't > > > matter because user space can come along and make it work > > > later. So no scattered kernel dependencies which I really > > > would like to avoid here. Its actually very painful to have > > > to support clusters with N kernels and M devices if they > > > have different features. Doable but annoying and much nicer > > > if we just say 6.2 has support for kptr rx descriptor reading > > > and all XDP drivers support it. So timestamp, rxhash work > > > across the board. > > > > IMHO that's a bit of wishful thinking. Driver support is just a small > > piece, you'll have different HW and FW versions, feature conflicts etc. > > In the end kernel version is just one variable and there are many others > > you'll already have to track. Agree. > > > > And it's actually harder to abstract away inter HW generation > > differences if the user space code has to handle all of it. I don't see how its any harder in practice though? > > I've had the same concern: > > Until we have some userspace library that abstracts all these details, > it's not really convenient to use. IIUC, with a kptr, I'd get a blob > of data and I need to go through the code and see what particular type > it represents for my particular device and how the data I need is > represented there. There are also these "if this is device v1 -> use > v1 descriptor format; if it's a v2->use this another struct; etc" > complexities that we'll be pushing onto the users. With kfuncs, we put > this burden on the driver developers, but I agree that the drawback > here is that we actually have to wait for the implementations to catch > up. I agree with everything there, you will get a blob of data and then will need to know what field you want to read using BTF. But, we already do this for BPF programs all over the place so its not a big lift for us. All other BPF tracing/observability requires the same logic. I think users of BPF in general perhaps XDP/tc are the only place left to write BPF programs without thinking about BTF and kernel data structures. But, with proposed kptr the complexity lives in userspace and can be fixed, added, updated without having to bother with kernel updates, etc. From my point of view of supporting Cilium its a win and much preferred to having to deal with driver owners on all cloud vendors, distributions, and so on. If vendor updates firmware with new fields I get those immediately. > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf > programs might have to take a lot of other state into consideration > when parsing the descriptors; all those details do seem like they > belong to the driver code. I would prefer to avoid being stuck on requiring driver writers to be involved. With just a kptr I can support the device and any firwmare versions without requiring help. > > Feel free to send it early with just a handful of drivers implemented; > I'm more interested about bpf/af_xdp/user api story; if we have some > nice sample/test case that shows how the metadata can be used, that > might push us closer to the agreement on the best way to proceed. I'll try to do a intel and mlx implementation to get a cross section. I have a good collection of nics here so should be able to show a couple firmware versions. It could be fine I think to have the raw kptr access and then also kfuncs for some things perhaps. > > > > > > To find the offset of fields (rxhash, timestamp) you can use > > > standard BTF relocations we have all this machinery built up > > > already for all the other structs we read, net_devices, task > > > structs, inodes, ... so its not a big hurdle at all IMO. We > > > can add userspace libs if folks really care, but its just a read so > > > I'm not even sure that is helpful. > > > > > > I think its nicer than having kfuncs that need to be written > > > everywhere. My $.02 although I'll poke around with below > > > some as well. Feel free to just hang tight until I have some > > > code at the moment I have intel, mellanox drivers that I > > > would want to support. > > > > I'd prefer if we left the door open for new vendors. Punting descriptor > > parsing to user space will indeed result in what you just said - major > > vendors are supported and that's it. I'm not sure about why it would make it harder for new vendors? I think the opposite, it would be easier because I don't need vendor support at all. Thinking it over seems there could be room for both. Thanks!
On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > > > And it's actually harder to abstract away inter HW generation > > > differences if the user space code has to handle all of it. > > I don't see how its any harder in practice though? You need to find out what HW/FW/config you're running, right? And all you have is a pointer to a blob of unknown type. Take timestamps for example, some NICs support adjusting the PHC or doing SW corrections (with different versions of hw/fw/server platforms being capable of both/one/neither). Sure you can extract all this info with tracing and careful inspection via uAPI. But I don't think that's _easier_. And the vendors can't run the results thru their validation (for whatever that's worth). > > I've had the same concern: > > > > Until we have some userspace library that abstracts all these details, > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob > > of data and I need to go through the code and see what particular type > > it represents for my particular device and how the data I need is > > represented there. There are also these "if this is device v1 -> use > > v1 descriptor format; if it's a v2->use this another struct; etc" > > complexities that we'll be pushing onto the users. With kfuncs, we put > > this burden on the driver developers, but I agree that the drawback > > here is that we actually have to wait for the implementations to catch > > up. > > I agree with everything there, you will get a blob of data and then > will need to know what field you want to read using BTF. But, we > already do this for BPF programs all over the place so its not a big > lift for us. All other BPF tracing/observability requires the same > logic. I think users of BPF in general perhaps XDP/tc are the only > place left to write BPF programs without thinking about BTF and > kernel data structures. > > But, with proposed kptr the complexity lives in userspace and can be > fixed, added, updated without having to bother with kernel updates, etc. > From my point of view of supporting Cilium its a win and much preferred > to having to deal with driver owners on all cloud vendors, distributions, > and so on. > > If vendor updates firmware with new fields I get those immediately. Conversely it's a valid concern that those who *do* actually update their kernel regularly will have more things to worry about. > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf > > programs might have to take a lot of other state into consideration > > when parsing the descriptors; all those details do seem like they > > belong to the driver code. > > I would prefer to avoid being stuck on requiring driver writers to > be involved. With just a kptr I can support the device and any > firwmare versions without requiring help. 1) where are you getting all those HW / FW specs :S 2) maybe *you* can but you're not exactly not an ex-driver developer :S > > Feel free to send it early with just a handful of drivers implemented; > > I'm more interested about bpf/af_xdp/user api story; if we have some > > nice sample/test case that shows how the metadata can be used, that > > might push us closer to the agreement on the best way to proceed. > > I'll try to do a intel and mlx implementation to get a cross section. > I have a good collection of nics here so should be able to show a > couple firmware versions. It could be fine I think to have the raw > kptr access and then also kfuncs for some things perhaps. > > > > I'd prefer if we left the door open for new vendors. Punting descriptor > > > parsing to user space will indeed result in what you just said - major > > > vendors are supported and that's it. > > I'm not sure about why it would make it harder for new vendors? I think > the opposite, TBH I'm only replying to the email because of the above part :) I thought this would be self evident, but I guess our perspectives are different. Perhaps you look at it from the perspective of SW running on someone else's cloud, an being able to move to another cloud, without having to worry if feature X is available in xdp or just skb. I look at it from the perspective of maintaining a cloud, with people writing random XDP applications. If I swap a NIC from an incumbent to a (superior) startup, and cloud users are messing with raw descriptor - I'd need to go find every XDP program out there and make sure it understands the new descriptors. There is a BPF foundation or whatnot now - what about starting a certification program for cloud providers and making it clear what features must be supported to be compatible with XDP 1.0, XDP 2.0 etc? > it would be easier because I don't need vendor support at all. Can you support the enfabrica NIC on day 1? :) To an extent, its just shifting the responsibility from the HW vendor to the middleware vendor. > Thinking it over seems there could be room for both. Are you thinking more or less Stan's proposal but with one of the callbacks being "give me the raw thing"? Probably as a ro dynptr? Possible, but I don't think we need to hold off Stan's work.
Hi all, I was closely following this discussion for some time now. Seems we reached the point where it's getting interesting for me. On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: > On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > > > > And it's actually harder to abstract away inter HW generation > > > > differences if the user space code has to handle all of it. > > > > I don't see how its any harder in practice though? > > You need to find out what HW/FW/config you're running, right? > And all you have is a pointer to a blob of unknown type. > > Take timestamps for example, some NICs support adjusting the PHC > or doing SW corrections (with different versions of hw/fw/server > platforms being capable of both/one/neither). > > Sure you can extract all this info with tracing and careful > inspection via uAPI. But I don't think that's _easier_. > And the vendors can't run the results thru their validation > (for whatever that's worth). > > > > I've had the same concern: > > > > > > Until we have some userspace library that abstracts all these details, > > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob > > > of data and I need to go through the code and see what particular type > > > it represents for my particular device and how the data I need is > > > represented there. There are also these "if this is device v1 -> use > > > v1 descriptor format; if it's a v2->use this another struct; etc" > > > complexities that we'll be pushing onto the users. With kfuncs, we put > > > this burden on the driver developers, but I agree that the drawback > > > here is that we actually have to wait for the implementations to catch > > > up. > > > > I agree with everything there, you will get a blob of data and then > > will need to know what field you want to read using BTF. But, we > > already do this for BPF programs all over the place so its not a big > > lift for us. All other BPF tracing/observability requires the same > > logic. I think users of BPF in general perhaps XDP/tc are the only > > place left to write BPF programs without thinking about BTF and > > kernel data structures. > > > > But, with proposed kptr the complexity lives in userspace and can be > > fixed, added, updated without having to bother with kernel updates, etc. > > From my point of view of supporting Cilium its a win and much preferred > > to having to deal with driver owners on all cloud vendors, distributions, > > and so on. > > > > If vendor updates firmware with new fields I get those immediately. > > Conversely it's a valid concern that those who *do* actually update > their kernel regularly will have more things to worry about. > > > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf > > > programs might have to take a lot of other state into consideration > > > when parsing the descriptors; all those details do seem like they > > > belong to the driver code. > > > > I would prefer to avoid being stuck on requiring driver writers to > > be involved. With just a kptr I can support the device and any > > firwmare versions without requiring help. > > 1) where are you getting all those HW / FW specs :S > 2) maybe *you* can but you're not exactly not an ex-driver developer :S > > > > Feel free to send it early with just a handful of drivers implemented; > > > I'm more interested about bpf/af_xdp/user api story; if we have some > > > nice sample/test case that shows how the metadata can be used, that > > > might push us closer to the agreement on the best way to proceed. > > > > I'll try to do a intel and mlx implementation to get a cross section. > > I have a good collection of nics here so should be able to show a > > couple firmware versions. It could be fine I think to have the raw > > kptr access and then also kfuncs for some things perhaps. > > > > > > I'd prefer if we left the door open for new vendors. Punting descriptor > > > > parsing to user space will indeed result in what you just said - major > > > > vendors are supported and that's it. > > > > I'm not sure about why it would make it harder for new vendors? I think > > the opposite, > > TBH I'm only replying to the email because of the above part :) > I thought this would be self evident, but I guess our perspectives > are different. > > Perhaps you look at it from the perspective of SW running on someone > else's cloud, an being able to move to another cloud, without having > to worry if feature X is available in xdp or just skb. > > I look at it from the perspective of maintaining a cloud, with people > writing random XDP applications. If I swap a NIC from an incumbent to a > (superior) startup, and cloud users are messing with raw descriptor - > I'd need to go find every XDP program out there and make sure it > understands the new descriptors. Here is another perspective: As AF_XDP application developer I don't wan't to deal with the underlying hardware in detail. I like to request a feature from the OS (in this case rx/tx timestamping). If the feature is available I will simply use it, if not I might have to work around it - maybe by falling back to SW timestamping. All parts of my application (BPF program included) should not be optimized/adjusted for all the different HW variants out there. My application might be run on bare metal/cloud/virtual systems. I do not want to care about this scenarios differently. I followed the idea of having a library for parsing the driver specific meta information. That would mean that this library has to keep in sync with the kernel, right? It doesn't help if a newer kernel provides XDP hints support for more devices/drivers but the library is not updated. That might be relevant for all the device update strategies out there. In addition - and maybe even contrary - we care about zero copy (ZC) support. Our current use case has to deal with a lot of small packets, so we hope to benefit from that. If XDP hints support requires a copy of the meta data - maybe to drive a HW independent interface - that might be a bottle neck for us. > > There is a BPF foundation or whatnot now - what about starting a > certification program for cloud providers and making it clear what > features must be supported to be compatible with XDP 1.0, XDP 2.0 etc? > > > it would be easier because I don't need vendor support at all. > > Can you support the enfabrica NIC on day 1? :) To an extent, its just > shifting the responsibility from the HW vendor to the middleware vendor. > > > Thinking it over seems there could be room for both. > > Are you thinking more or less Stan's proposal but with one of > the callbacks being "give me the raw thing"? Probably as a ro dynptr? > Possible, but I don't think we need to hold off Stan's work.
"Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > Hi all, > > I was closely following this discussion for some time now. Seems we > reached the point where it's getting interesting for me. > > On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: >> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: >> > > > And it's actually harder to abstract away inter HW generation >> > > > differences if the user space code has to handle all of it. >> > >> > I don't see how its any harder in practice though? >> >> You need to find out what HW/FW/config you're running, right? >> And all you have is a pointer to a blob of unknown type. >> >> Take timestamps for example, some NICs support adjusting the PHC >> or doing SW corrections (with different versions of hw/fw/server >> platforms being capable of both/one/neither). >> >> Sure you can extract all this info with tracing and careful >> inspection via uAPI. But I don't think that's _easier_. >> And the vendors can't run the results thru their validation >> (for whatever that's worth). >> >> > > I've had the same concern: >> > > >> > > Until we have some userspace library that abstracts all these details, >> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob >> > > of data and I need to go through the code and see what particular type >> > > it represents for my particular device and how the data I need is >> > > represented there. There are also these "if this is device v1 -> use >> > > v1 descriptor format; if it's a v2->use this another struct; etc" >> > > complexities that we'll be pushing onto the users. With kfuncs, we put >> > > this burden on the driver developers, but I agree that the drawback >> > > here is that we actually have to wait for the implementations to catch >> > > up. >> > >> > I agree with everything there, you will get a blob of data and then >> > will need to know what field you want to read using BTF. But, we >> > already do this for BPF programs all over the place so its not a big >> > lift for us. All other BPF tracing/observability requires the same >> > logic. I think users of BPF in general perhaps XDP/tc are the only >> > place left to write BPF programs without thinking about BTF and >> > kernel data structures. >> > >> > But, with proposed kptr the complexity lives in userspace and can be >> > fixed, added, updated without having to bother with kernel updates, etc. >> > From my point of view of supporting Cilium its a win and much preferred >> > to having to deal with driver owners on all cloud vendors, distributions, >> > and so on. >> > >> > If vendor updates firmware with new fields I get those immediately. >> >> Conversely it's a valid concern that those who *do* actually update >> their kernel regularly will have more things to worry about. >> >> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf >> > > programs might have to take a lot of other state into consideration >> > > when parsing the descriptors; all those details do seem like they >> > > belong to the driver code. >> > >> > I would prefer to avoid being stuck on requiring driver writers to >> > be involved. With just a kptr I can support the device and any >> > firwmare versions without requiring help. >> >> 1) where are you getting all those HW / FW specs :S >> 2) maybe *you* can but you're not exactly not an ex-driver developer :S >> >> > > Feel free to send it early with just a handful of drivers implemented; >> > > I'm more interested about bpf/af_xdp/user api story; if we have some >> > > nice sample/test case that shows how the metadata can be used, that >> > > might push us closer to the agreement on the best way to proceed. >> > >> > I'll try to do a intel and mlx implementation to get a cross section. >> > I have a good collection of nics here so should be able to show a >> > couple firmware versions. It could be fine I think to have the raw >> > kptr access and then also kfuncs for some things perhaps. >> > >> > > > I'd prefer if we left the door open for new vendors. Punting descriptor >> > > > parsing to user space will indeed result in what you just said - major >> > > > vendors are supported and that's it. >> > >> > I'm not sure about why it would make it harder for new vendors? I think >> > the opposite, >> >> TBH I'm only replying to the email because of the above part :) >> I thought this would be self evident, but I guess our perspectives >> are different. >> >> Perhaps you look at it from the perspective of SW running on someone >> else's cloud, an being able to move to another cloud, without having >> to worry if feature X is available in xdp or just skb. >> >> I look at it from the perspective of maintaining a cloud, with people >> writing random XDP applications. If I swap a NIC from an incumbent to a >> (superior) startup, and cloud users are messing with raw descriptor - >> I'd need to go find every XDP program out there and make sure it >> understands the new descriptors. > > Here is another perspective: > > As AF_XDP application developer I don't wan't to deal with the > underlying hardware in detail. I like to request a feature from the OS > (in this case rx/tx timestamping). If the feature is available I will > simply use it, if not I might have to work around it - maybe by falling > back to SW timestamping. > > All parts of my application (BPF program included) should not be > optimized/adjusted for all the different HW variants out there. Yes, absolutely agreed. Abstracting away those kinds of hardware differences is the whole *point* of having an OS/driver model. I.e., it's what the kernel is there for! If people want to bypass that and get direct access to the hardware, they can already do that by using DPDK. So in other words, 100% agreed that we should not expect the BPF developers to deal with hardware details as would be required with a kptr-based interface. As for the kfunc-based interface, I think it shows some promise. Exposing a list of function names to retrieve individual metadata items instead of a struct layout is sorta comparable in terms of developer UI accessibility etc (IMO). There are three main drawbacks, AFAICT: 1. It requires driver developers to write and maintain the code that generates the unrolled BPF bytecode to access the metadata fields, which is a non-trivial amount of complexity. Maybe this can be abstracted away with some internal helpers though (like, e.g., a bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out the required JMP/MOV/LDX instructions? 2. AF_XDP programs won't be able to access the metadata without using a custom XDP program that calls the kfuncs and puts the data into the metadata area. We could solve this with some code in libxdp, though; if this code can be made generic enough (so it just dumps the available metadata functions from the running kernel at load time), it may be possible to make it generic enough that it will be forward-compatible with new versions of the kernel that add new fields, which should alleviate Florian's concern about keeping things in sync. 3. It will make it harder to consume the metadata when building SKBs. I think the CPUMAP and veth use cases are also quite important, and that we want metadata to be available for building SKBs in this path. Maybe this can be resolved by having a convenient kfunc for this that can be used for programs doing such redirects. E.g., you could just call xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that would recursively expand into all the kfunc calls needed to extract the metadata supported by the SKB path? -Toke
On Mon, Oct 31, 2022 at 8:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > > > Hi all, > > > > I was closely following this discussion for some time now. Seems we > > reached the point where it's getting interesting for me. > > > > On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: > >> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > >> > > > And it's actually harder to abstract away inter HW generation > >> > > > differences if the user space code has to handle all of it. > >> > > >> > I don't see how its any harder in practice though? > >> > >> You need to find out what HW/FW/config you're running, right? > >> And all you have is a pointer to a blob of unknown type. > >> > >> Take timestamps for example, some NICs support adjusting the PHC > >> or doing SW corrections (with different versions of hw/fw/server > >> platforms being capable of both/one/neither). > >> > >> Sure you can extract all this info with tracing and careful > >> inspection via uAPI. But I don't think that's _easier_. > >> And the vendors can't run the results thru their validation > >> (for whatever that's worth). > >> > >> > > I've had the same concern: > >> > > > >> > > Until we have some userspace library that abstracts all these details, > >> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob > >> > > of data and I need to go through the code and see what particular type > >> > > it represents for my particular device and how the data I need is > >> > > represented there. There are also these "if this is device v1 -> use > >> > > v1 descriptor format; if it's a v2->use this another struct; etc" > >> > > complexities that we'll be pushing onto the users. With kfuncs, we put > >> > > this burden on the driver developers, but I agree that the drawback > >> > > here is that we actually have to wait for the implementations to catch > >> > > up. > >> > > >> > I agree with everything there, you will get a blob of data and then > >> > will need to know what field you want to read using BTF. But, we > >> > already do this for BPF programs all over the place so its not a big > >> > lift for us. All other BPF tracing/observability requires the same > >> > logic. I think users of BPF in general perhaps XDP/tc are the only > >> > place left to write BPF programs without thinking about BTF and > >> > kernel data structures. > >> > > >> > But, with proposed kptr the complexity lives in userspace and can be > >> > fixed, added, updated without having to bother with kernel updates, etc. > >> > From my point of view of supporting Cilium its a win and much preferred > >> > to having to deal with driver owners on all cloud vendors, distributions, > >> > and so on. > >> > > >> > If vendor updates firmware with new fields I get those immediately. > >> > >> Conversely it's a valid concern that those who *do* actually update > >> their kernel regularly will have more things to worry about. > >> > >> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf > >> > > programs might have to take a lot of other state into consideration > >> > > when parsing the descriptors; all those details do seem like they > >> > > belong to the driver code. > >> > > >> > I would prefer to avoid being stuck on requiring driver writers to > >> > be involved. With just a kptr I can support the device and any > >> > firwmare versions without requiring help. > >> > >> 1) where are you getting all those HW / FW specs :S > >> 2) maybe *you* can but you're not exactly not an ex-driver developer :S > >> > >> > > Feel free to send it early with just a handful of drivers implemented; > >> > > I'm more interested about bpf/af_xdp/user api story; if we have some > >> > > nice sample/test case that shows how the metadata can be used, that > >> > > might push us closer to the agreement on the best way to proceed. > >> > > >> > I'll try to do a intel and mlx implementation to get a cross section. > >> > I have a good collection of nics here so should be able to show a > >> > couple firmware versions. It could be fine I think to have the raw > >> > kptr access and then also kfuncs for some things perhaps. > >> > > >> > > > I'd prefer if we left the door open for new vendors. Punting descriptor > >> > > > parsing to user space will indeed result in what you just said - major > >> > > > vendors are supported and that's it. > >> > > >> > I'm not sure about why it would make it harder for new vendors? I think > >> > the opposite, > >> > >> TBH I'm only replying to the email because of the above part :) > >> I thought this would be self evident, but I guess our perspectives > >> are different. > >> > >> Perhaps you look at it from the perspective of SW running on someone > >> else's cloud, an being able to move to another cloud, without having > >> to worry if feature X is available in xdp or just skb. > >> > >> I look at it from the perspective of maintaining a cloud, with people > >> writing random XDP applications. If I swap a NIC from an incumbent to a > >> (superior) startup, and cloud users are messing with raw descriptor - > >> I'd need to go find every XDP program out there and make sure it > >> understands the new descriptors. > > > > Here is another perspective: > > > > As AF_XDP application developer I don't wan't to deal with the > > underlying hardware in detail. I like to request a feature from the OS > > (in this case rx/tx timestamping). If the feature is available I will > > simply use it, if not I might have to work around it - maybe by falling > > back to SW timestamping. > > > > All parts of my application (BPF program included) should not be > > optimized/adjusted for all the different HW variants out there. > > Yes, absolutely agreed. Abstracting away those kinds of hardware > differences is the whole *point* of having an OS/driver model. I.e., > it's what the kernel is there for! If people want to bypass that and get > direct access to the hardware, they can already do that by using DPDK. > > So in other words, 100% agreed that we should not expect the BPF > developers to deal with hardware details as would be required with a > kptr-based interface. > > As for the kfunc-based interface, I think it shows some promise. > Exposing a list of function names to retrieve individual metadata items > instead of a struct layout is sorta comparable in terms of developer UI > accessibility etc (IMO). > > There are three main drawbacks, AFAICT: > > 1. It requires driver developers to write and maintain the code that > generates the unrolled BPF bytecode to access the metadata fields, which > is a non-trivial amount of complexity. Maybe this can be abstracted away > with some internal helpers though (like, e.g., a > bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out > the required JMP/MOV/LDX instructions? Right, I hope we can have some helpers to abstract the raw instructions. I might need to try to implement the actual metadata fetching for some real devices and see how well it works in practice. > 2. AF_XDP programs won't be able to access the metadata without using a > custom XDP program that calls the kfuncs and puts the data into the > metadata area. We could solve this with some code in libxdp, though; if > this code can be made generic enough (so it just dumps the available > metadata functions from the running kernel at load time), it may be > possible to make it generic enough that it will be forward-compatible > with new versions of the kernel that add new fields, which should > alleviate Florian's concern about keeping things in sync. Good point. I had to convert to a custom program to use the kfuncs :-( But your suggestion sounds good; maybe libxdp can accept some extra info about at which offset the user would like to place the metadata and the library can generate the required bytecode? > 3. It will make it harder to consume the metadata when building SKBs. I > think the CPUMAP and veth use cases are also quite important, and that > we want metadata to be available for building SKBs in this path. Maybe > this can be resolved by having a convenient kfunc for this that can be > used for programs doing such redirects. E.g., you could just call > xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > would recursively expand into all the kfunc calls needed to extract the > metadata supported by the SKB path? So this xdp_copy_metadata_for_skb will create a metadata layout that the kernel will be able to understand when converting back to skb? IIUC, the xdp program will look something like the following: if (xdp packet is to be consumed by af_xdp) { // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your own metadata layout return bpf_redirect_map(xsk, ...); } else { // if the packet is to be consumed by the kernel xdp_copy_metadata_for_skb(ctx); return bpf_redirect(...); } Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe put some magic number in the first byte(s) of the metadata so the kernel can check whether xdp_copy_metadata_for_skb has been called previously (or maybe xdp_frame can carry this extra signal, idk).
Jakub Kicinski wrote: > On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > > > > And it's actually harder to abstract away inter HW generation > > > > differences if the user space code has to handle all of it. > > > > I don't see how its any harder in practice though? > > You need to find out what HW/FW/config you're running, right? > And all you have is a pointer to a blob of unknown type. Yep. I guess I'm in the position of already having to do this somewhat to collect stats from the device. Although its maybe a bit more involved here by vendors that are versioning the descriptors. Also nit, its not unknown type we know the full type by BTF. > > Take timestamps for example, some NICs support adjusting the PHC > or doing SW corrections (with different versions of hw/fw/server > platforms being capable of both/one/neither). Its worse actually. Having started to do this timestamping it is not at all consistent across nics so I think we are stuck having to know hw specifics here regardless. Also some nics will timestamp all RX pkts, some specific pkts, some require configuration to decide which mode to run in and so on. You then end up with a matrix of features supported by hw/fw/sw and desired state and I can't see any way around this. > > Sure you can extract all this info with tracing and careful > inspection via uAPI. But I don't think that's _easier_. > And the vendors can't run the results thru their validation > (for whatever that's worth). I think you hit our point of view differences below. See I don't want to depend on the vendor. I want access to the fields otherwise I'm stuck working with vendors on their time frames. You have the other perspective of supporting the NIC and ability to update kernels where as I still live with 4.18/4.19 kernels (even 4.14 sometimes). So what we land now needs to work in 5 years still. > > > > I've had the same concern: > > > > > > Until we have some userspace library that abstracts all these details, > > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob > > > of data and I need to go through the code and see what particular type > > > it represents for my particular device and how the data I need is > > > represented there. There are also these "if this is device v1 -> use > > > v1 descriptor format; if it's a v2->use this another struct; etc" > > > complexities that we'll be pushing onto the users. With kfuncs, we put > > > this burden on the driver developers, but I agree that the drawback > > > here is that we actually have to wait for the implementations to catch > > > up. > > > > I agree with everything there, you will get a blob of data and then > > will need to know what field you want to read using BTF. But, we > > already do this for BPF programs all over the place so its not a big > > lift for us. All other BPF tracing/observability requires the same > > logic. I think users of BPF in general perhaps XDP/tc are the only > > place left to write BPF programs without thinking about BTF and > > kernel data structures. > > > > But, with proposed kptr the complexity lives in userspace and can be > > fixed, added, updated without having to bother with kernel updates, etc. > > From my point of view of supporting Cilium its a win and much preferred > > to having to deal with driver owners on all cloud vendors, distributions, > > and so on. > > > > If vendor updates firmware with new fields I get those immediately. > > Conversely it's a valid concern that those who *do* actually update > their kernel regularly will have more things to worry about. I'm not sure if a kptr_func is any harder to write than a user space relocation for that func? In tetragon and cilium we've done userspace rewrites for some time. Happy to generalize that infra into kernel repo if that helps. IMO having a libhw.h in kernel tree ./tools/bpf directory would work. > > > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf > > > programs might have to take a lot of other state into consideration > > > when parsing the descriptors; all those details do seem like they > > > belong to the driver code. > > > > I would prefer to avoid being stuck on requiring driver writers to > > be involved. With just a kptr I can support the device and any > > firwmare versions without requiring help. > > 1) where are you getting all those HW / FW specs :S Most are public docs of course vendors have internal docs with more details but what can you do. Also source code has the structs. > 2) maybe *you* can but you're not exactly not an ex-driver developer :S Sure :) but we put a libhw.h file in kernel and test with selftests (which will be hard without hardware) and then not everyone needs to be a driver internals expert. > > > > Feel free to send it early with just a handful of drivers implemented; > > > I'm more interested about bpf/af_xdp/user api story; if we have some > > > nice sample/test case that shows how the metadata can be used, that > > > might push us closer to the agreement on the best way to proceed. > > > > I'll try to do a intel and mlx implementation to get a cross section. > > I have a good collection of nics here so should be able to show a > > couple firmware versions. It could be fine I think to have the raw > > kptr access and then also kfuncs for some things perhaps. > > > > > > I'd prefer if we left the door open for new vendors. Punting descriptor > > > > parsing to user space will indeed result in what you just said - major > > > > vendors are supported and that's it. > > > > I'm not sure about why it would make it harder for new vendors? I think > > the opposite, > > TBH I'm only replying to the email because of the above part :) > I thought this would be self evident, but I guess our perspectives > are different. Yep. > > Perhaps you look at it from the perspective of SW running on someone > else's cloud, an being able to move to another cloud, without having > to worry if feature X is available in xdp or just skb. Exactly. I have SW running in a data center or cloud for a security team or ops team and they don't own the platform usually. Anyways the platform team is going to stay on a LTS kernel for at least a year or two most likely. I maintain the SW and some 3rd party nic vendor may not even know about my SW (cilium/tetragon). > > I look at it from the perspective of maintaining a cloud, with people > writing random XDP applications. If I swap a NIC from an incumbent to a > (superior) startup, and cloud users are messing with raw descriptor - > I'd need to go find every XDP program out there and make sure it > understands the new descriptors. I get it. Its interesting that you wouldn't tell the XDP programmers to deal with it which is my case. My $.02 is a userspace lib could abstract this easier than a kernel func and also add new features without rolling new kernels. > > There is a BPF foundation or whatnot now - what about starting a > certification program for cloud providers and making it clear what > features must be supported to be compatible with XDP 1.0, XDP 2.0 etc? Maybe but still stuck on kernel versions. > > > it would be easier because I don't need vendor support at all. > > Can you support the enfabrica NIC on day 1? :) To an extent, its just > shifting the responsibility from the HW vendor to the middleware vendor. Yep. With important detail that I can run new features on old kernels even if vendor didn't add rxhash or timestamp out the gate. Nothing stops the hw vendor from contributing to this in kernel source bpf lib with the features though. > > > Thinking it over seems there could be room for both. > > Are you thinking more or less Stan's proposal but with one of > the callbacks being "give me the raw thing"? Probably as a ro dynptr? > Possible, but I don't think we need to hold off Stan's work. Yeah that was my thinking. Both could coexist. OTOH doing it in BPF program lib side seems cleaner to me from a kernel maitenance and hardware support. We've had trouble getting drivers to support XDP features so adding more requirements of entry seems problematic to me when we can avoid it.
On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: > "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > >> Hi all, >> >> I was closely following this discussion for some time now. Seems we >> reached the point where it's getting interesting for me. >> >> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: >>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: >>>>>> And it's actually harder to abstract away inter HW generation >>>>>> differences if the user space code has to handle all of it. >>>> >>>> I don't see how its any harder in practice though? >>> >>> You need to find out what HW/FW/config you're running, right? >>> And all you have is a pointer to a blob of unknown type. >>> >>> Take timestamps for example, some NICs support adjusting the PHC >>> or doing SW corrections (with different versions of hw/fw/server >>> platforms being capable of both/one/neither). >>> >>> Sure you can extract all this info with tracing and careful >>> inspection via uAPI. But I don't think that's _easier_. >>> And the vendors can't run the results thru their validation >>> (for whatever that's worth). >>> >>>>> I've had the same concern: >>>>> >>>>> Until we have some userspace library that abstracts all these details, >>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob >>>>> of data and I need to go through the code and see what particular type >>>>> it represents for my particular device and how the data I need is >>>>> represented there. There are also these "if this is device v1 -> use >>>>> v1 descriptor format; if it's a v2->use this another struct; etc" >>>>> complexities that we'll be pushing onto the users. With kfuncs, we put >>>>> this burden on the driver developers, but I agree that the drawback >>>>> here is that we actually have to wait for the implementations to catch >>>>> up. >>>> >>>> I agree with everything there, you will get a blob of data and then >>>> will need to know what field you want to read using BTF. But, we >>>> already do this for BPF programs all over the place so its not a big >>>> lift for us. All other BPF tracing/observability requires the same >>>> logic. I think users of BPF in general perhaps XDP/tc are the only >>>> place left to write BPF programs without thinking about BTF and >>>> kernel data structures. >>>> >>>> But, with proposed kptr the complexity lives in userspace and can be >>>> fixed, added, updated without having to bother with kernel updates, etc. >>>> From my point of view of supporting Cilium its a win and much preferred >>>> to having to deal with driver owners on all cloud vendors, distributions, >>>> and so on. >>>> >>>> If vendor updates firmware with new fields I get those immediately. >>> >>> Conversely it's a valid concern that those who *do* actually update >>> their kernel regularly will have more things to worry about. >>> >>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf >>>>> programs might have to take a lot of other state into consideration >>>>> when parsing the descriptors; all those details do seem like they >>>>> belong to the driver code. >>>> >>>> I would prefer to avoid being stuck on requiring driver writers to >>>> be involved. With just a kptr I can support the device and any >>>> firwmare versions without requiring help. >>> >>> 1) where are you getting all those HW / FW specs :S >>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S >>> >>>>> Feel free to send it early with just a handful of drivers implemented; >>>>> I'm more interested about bpf/af_xdp/user api story; if we have some >>>>> nice sample/test case that shows how the metadata can be used, that >>>>> might push us closer to the agreement on the best way to proceed. >>>> >>>> I'll try to do a intel and mlx implementation to get a cross section. >>>> I have a good collection of nics here so should be able to show a >>>> couple firmware versions. It could be fine I think to have the raw >>>> kptr access and then also kfuncs for some things perhaps. >>>> >>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor >>>>>> parsing to user space will indeed result in what you just said - major >>>>>> vendors are supported and that's it. >>>> >>>> I'm not sure about why it would make it harder for new vendors? I think >>>> the opposite, >>> >>> TBH I'm only replying to the email because of the above part :) >>> I thought this would be self evident, but I guess our perspectives >>> are different. >>> >>> Perhaps you look at it from the perspective of SW running on someone >>> else's cloud, an being able to move to another cloud, without having >>> to worry if feature X is available in xdp or just skb. >>> >>> I look at it from the perspective of maintaining a cloud, with people >>> writing random XDP applications. If I swap a NIC from an incumbent to a >>> (superior) startup, and cloud users are messing with raw descriptor - >>> I'd need to go find every XDP program out there and make sure it >>> understands the new descriptors. >> >> Here is another perspective: >> >> As AF_XDP application developer I don't wan't to deal with the >> underlying hardware in detail. I like to request a feature from the OS >> (in this case rx/tx timestamping). If the feature is available I will >> simply use it, if not I might have to work around it - maybe by falling >> back to SW timestamping. >> >> All parts of my application (BPF program included) should not be >> optimized/adjusted for all the different HW variants out there. > > Yes, absolutely agreed. Abstracting away those kinds of hardware > differences is the whole *point* of having an OS/driver model. I.e., > it's what the kernel is there for! If people want to bypass that and get > direct access to the hardware, they can already do that by using DPDK. > > So in other words, 100% agreed that we should not expect the BPF > developers to deal with hardware details as would be required with a > kptr-based interface. > > As for the kfunc-based interface, I think it shows some promise. > Exposing a list of function names to retrieve individual metadata items > instead of a struct layout is sorta comparable in terms of developer UI > accessibility etc (IMO). Looks like there are quite some use cases for hw_timestamp. Do you think we could add it to the uapi like struct xdp_md? The following is the current xdp_md: struct xdp_md { __u32 data; __u32 data_end; __u32 data_meta; /* Below access go through struct xdp_rxq_info */ __u32 ingress_ifindex; /* rxq->dev->ifindex */ __u32 rx_queue_index; /* rxq->queue_index */ __u32 egress_ifindex; /* txq->dev->ifindex */ }; We could add __u64 hw_timestamp to the xdp_md so user can just do xdp_md->hw_timestamp to get the value. xdp_md->hw_timestamp == 0 means hw_timestamp is not available. Inside the kernel, the ctx rewriter can generate code to call driver specific function to retrieve the data. The kfunc approach can be used to *less* common use cases? > > There are three main drawbacks, AFAICT: > > 1. It requires driver developers to write and maintain the code that > generates the unrolled BPF bytecode to access the metadata fields, which > is a non-trivial amount of complexity. Maybe this can be abstracted away > with some internal helpers though (like, e.g., a > bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out > the required JMP/MOV/LDX instructions? > > 2. AF_XDP programs won't be able to access the metadata without using a > custom XDP program that calls the kfuncs and puts the data into the > metadata area. We could solve this with some code in libxdp, though; if > this code can be made generic enough (so it just dumps the available > metadata functions from the running kernel at load time), it may be > possible to make it generic enough that it will be forward-compatible > with new versions of the kernel that add new fields, which should > alleviate Florian's concern about keeping things in sync. > > 3. It will make it harder to consume the metadata when building SKBs. I > think the CPUMAP and veth use cases are also quite important, and that > we want metadata to be available for building SKBs in this path. Maybe > this can be resolved by having a convenient kfunc for this that can be > used for programs doing such redirects. E.g., you could just call > xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > would recursively expand into all the kfunc calls needed to extract the > metadata supported by the SKB path? > > -Toke >
On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: > > "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > > > >> Hi all, > >> > >> I was closely following this discussion for some time now. Seems we > >> reached the point where it's getting interesting for me. > >> > >> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: > >>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > >>>>>> And it's actually harder to abstract away inter HW generation > >>>>>> differences if the user space code has to handle all of it. > >>>> > >>>> I don't see how its any harder in practice though? > >>> > >>> You need to find out what HW/FW/config you're running, right? > >>> And all you have is a pointer to a blob of unknown type. > >>> > >>> Take timestamps for example, some NICs support adjusting the PHC > >>> or doing SW corrections (with different versions of hw/fw/server > >>> platforms being capable of both/one/neither). > >>> > >>> Sure you can extract all this info with tracing and careful > >>> inspection via uAPI. But I don't think that's _easier_. > >>> And the vendors can't run the results thru their validation > >>> (for whatever that's worth). > >>> > >>>>> I've had the same concern: > >>>>> > >>>>> Until we have some userspace library that abstracts all these details, > >>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob > >>>>> of data and I need to go through the code and see what particular type > >>>>> it represents for my particular device and how the data I need is > >>>>> represented there. There are also these "if this is device v1 -> use > >>>>> v1 descriptor format; if it's a v2->use this another struct; etc" > >>>>> complexities that we'll be pushing onto the users. With kfuncs, we put > >>>>> this burden on the driver developers, but I agree that the drawback > >>>>> here is that we actually have to wait for the implementations to catch > >>>>> up. > >>>> > >>>> I agree with everything there, you will get a blob of data and then > >>>> will need to know what field you want to read using BTF. But, we > >>>> already do this for BPF programs all over the place so its not a big > >>>> lift for us. All other BPF tracing/observability requires the same > >>>> logic. I think users of BPF in general perhaps XDP/tc are the only > >>>> place left to write BPF programs without thinking about BTF and > >>>> kernel data structures. > >>>> > >>>> But, with proposed kptr the complexity lives in userspace and can be > >>>> fixed, added, updated without having to bother with kernel updates, etc. > >>>> From my point of view of supporting Cilium its a win and much preferred > >>>> to having to deal with driver owners on all cloud vendors, distributions, > >>>> and so on. > >>>> > >>>> If vendor updates firmware with new fields I get those immediately. > >>> > >>> Conversely it's a valid concern that those who *do* actually update > >>> their kernel regularly will have more things to worry about. > >>> > >>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf > >>>>> programs might have to take a lot of other state into consideration > >>>>> when parsing the descriptors; all those details do seem like they > >>>>> belong to the driver code. > >>>> > >>>> I would prefer to avoid being stuck on requiring driver writers to > >>>> be involved. With just a kptr I can support the device and any > >>>> firwmare versions without requiring help. > >>> > >>> 1) where are you getting all those HW / FW specs :S > >>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S > >>> > >>>>> Feel free to send it early with just a handful of drivers implemented; > >>>>> I'm more interested about bpf/af_xdp/user api story; if we have some > >>>>> nice sample/test case that shows how the metadata can be used, that > >>>>> might push us closer to the agreement on the best way to proceed. > >>>> > >>>> I'll try to do a intel and mlx implementation to get a cross section. > >>>> I have a good collection of nics here so should be able to show a > >>>> couple firmware versions. It could be fine I think to have the raw > >>>> kptr access and then also kfuncs for some things perhaps. > >>>> > >>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor > >>>>>> parsing to user space will indeed result in what you just said - major > >>>>>> vendors are supported and that's it. > >>>> > >>>> I'm not sure about why it would make it harder for new vendors? I think > >>>> the opposite, > >>> > >>> TBH I'm only replying to the email because of the above part :) > >>> I thought this would be self evident, but I guess our perspectives > >>> are different. > >>> > >>> Perhaps you look at it from the perspective of SW running on someone > >>> else's cloud, an being able to move to another cloud, without having > >>> to worry if feature X is available in xdp or just skb. > >>> > >>> I look at it from the perspective of maintaining a cloud, with people > >>> writing random XDP applications. If I swap a NIC from an incumbent to a > >>> (superior) startup, and cloud users are messing with raw descriptor - > >>> I'd need to go find every XDP program out there and make sure it > >>> understands the new descriptors. > >> > >> Here is another perspective: > >> > >> As AF_XDP application developer I don't wan't to deal with the > >> underlying hardware in detail. I like to request a feature from the OS > >> (in this case rx/tx timestamping). If the feature is available I will > >> simply use it, if not I might have to work around it - maybe by falling > >> back to SW timestamping. > >> > >> All parts of my application (BPF program included) should not be > >> optimized/adjusted for all the different HW variants out there. > > > > Yes, absolutely agreed. Abstracting away those kinds of hardware > > differences is the whole *point* of having an OS/driver model. I.e., > > it's what the kernel is there for! If people want to bypass that and get > > direct access to the hardware, they can already do that by using DPDK. > > > > So in other words, 100% agreed that we should not expect the BPF > > developers to deal with hardware details as would be required with a > > kptr-based interface. > > > > As for the kfunc-based interface, I think it shows some promise. > > Exposing a list of function names to retrieve individual metadata items > > instead of a struct layout is sorta comparable in terms of developer UI > > accessibility etc (IMO). > > Looks like there are quite some use cases for hw_timestamp. > Do you think we could add it to the uapi like struct xdp_md? > > The following is the current xdp_md: > struct xdp_md { > __u32 data; > __u32 data_end; > __u32 data_meta; > /* Below access go through struct xdp_rxq_info */ > __u32 ingress_ifindex; /* rxq->dev->ifindex */ > __u32 rx_queue_index; /* rxq->queue_index */ > > __u32 egress_ifindex; /* txq->dev->ifindex */ > }; > > We could add __u64 hw_timestamp to the xdp_md so user > can just do xdp_md->hw_timestamp to get the value. > xdp_md->hw_timestamp == 0 means hw_timestamp is not > available. > > Inside the kernel, the ctx rewriter can generate code > to call driver specific function to retrieve the data. If the driver generates the code to retrieve the data, how's that different from the kfunc approach? The only difference I see is that it would be a more strong UAPI than the kfuncs? > The kfunc approach can be used to *less* common use cases? What's the advantage of having two approaches when one can cover common and uncommon cases? > > There are three main drawbacks, AFAICT: > > > > 1. It requires driver developers to write and maintain the code that > > generates the unrolled BPF bytecode to access the metadata fields, which > > is a non-trivial amount of complexity. Maybe this can be abstracted away > > with some internal helpers though (like, e.g., a > > bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out > > the required JMP/MOV/LDX instructions? > > > > 2. AF_XDP programs won't be able to access the metadata without using a > > custom XDP program that calls the kfuncs and puts the data into the > > metadata area. We could solve this with some code in libxdp, though; if > > this code can be made generic enough (so it just dumps the available > > metadata functions from the running kernel at load time), it may be > > possible to make it generic enough that it will be forward-compatible > > with new versions of the kernel that add new fields, which should > > alleviate Florian's concern about keeping things in sync. > > > > 3. It will make it harder to consume the metadata when building SKBs. I > > think the CPUMAP and veth use cases are also quite important, and that > > we want metadata to be available for building SKBs in this path. Maybe > > this can be resolved by having a convenient kfunc for this that can be > > used for programs doing such redirects. E.g., you could just call > > xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > > would recursively expand into all the kfunc calls needed to extract the > > metadata supported by the SKB path? > > > > -Toke > >
On 10/31/22 3:09 PM, Stanislav Fomichev wrote: > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote: >> >> >> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: >>> >>>> Hi all, >>>> >>>> I was closely following this discussion for some time now. Seems we >>>> reached the point where it's getting interesting for me. >>>> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: >>>>>>>> And it's actually harder to abstract away inter HW generation >>>>>>>> differences if the user space code has to handle all of it. >>>>>> >>>>>> I don't see how its any harder in practice though? >>>>> >>>>> You need to find out what HW/FW/config you're running, right? >>>>> And all you have is a pointer to a blob of unknown type. >>>>> >>>>> Take timestamps for example, some NICs support adjusting the PHC >>>>> or doing SW corrections (with different versions of hw/fw/server >>>>> platforms being capable of both/one/neither). >>>>> >>>>> Sure you can extract all this info with tracing and careful >>>>> inspection via uAPI. But I don't think that's _easier_. >>>>> And the vendors can't run the results thru their validation >>>>> (for whatever that's worth). >>>>> >>>>>>> I've had the same concern: >>>>>>> >>>>>>> Until we have some userspace library that abstracts all these details, >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob >>>>>>> of data and I need to go through the code and see what particular type >>>>>>> it represents for my particular device and how the data I need is >>>>>>> represented there. There are also these "if this is device v1 -> use >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc" >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put >>>>>>> this burden on the driver developers, but I agree that the drawback >>>>>>> here is that we actually have to wait for the implementations to catch >>>>>>> up. >>>>>> >>>>>> I agree with everything there, you will get a blob of data and then >>>>>> will need to know what field you want to read using BTF. But, we >>>>>> already do this for BPF programs all over the place so its not a big >>>>>> lift for us. All other BPF tracing/observability requires the same >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only >>>>>> place left to write BPF programs without thinking about BTF and >>>>>> kernel data structures. >>>>>> >>>>>> But, with proposed kptr the complexity lives in userspace and can be >>>>>> fixed, added, updated without having to bother with kernel updates, etc. >>>>>> From my point of view of supporting Cilium its a win and much preferred >>>>>> to having to deal with driver owners on all cloud vendors, distributions, >>>>>> and so on. >>>>>> >>>>>> If vendor updates firmware with new fields I get those immediately. >>>>> >>>>> Conversely it's a valid concern that those who *do* actually update >>>>> their kernel regularly will have more things to worry about. >>>>> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf >>>>>>> programs might have to take a lot of other state into consideration >>>>>>> when parsing the descriptors; all those details do seem like they >>>>>>> belong to the driver code. >>>>>> >>>>>> I would prefer to avoid being stuck on requiring driver writers to >>>>>> be involved. With just a kptr I can support the device and any >>>>>> firwmare versions without requiring help. >>>>> >>>>> 1) where are you getting all those HW / FW specs :S >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S >>>>> >>>>>>> Feel free to send it early with just a handful of drivers implemented; >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some >>>>>>> nice sample/test case that shows how the metadata can be used, that >>>>>>> might push us closer to the agreement on the best way to proceed. >>>>>> >>>>>> I'll try to do a intel and mlx implementation to get a cross section. >>>>>> I have a good collection of nics here so should be able to show a >>>>>> couple firmware versions. It could be fine I think to have the raw >>>>>> kptr access and then also kfuncs for some things perhaps. >>>>>> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor >>>>>>>> parsing to user space will indeed result in what you just said - major >>>>>>>> vendors are supported and that's it. >>>>>> >>>>>> I'm not sure about why it would make it harder for new vendors? I think >>>>>> the opposite, >>>>> >>>>> TBH I'm only replying to the email because of the above part :) >>>>> I thought this would be self evident, but I guess our perspectives >>>>> are different. >>>>> >>>>> Perhaps you look at it from the perspective of SW running on someone >>>>> else's cloud, an being able to move to another cloud, without having >>>>> to worry if feature X is available in xdp or just skb. >>>>> >>>>> I look at it from the perspective of maintaining a cloud, with people >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a >>>>> (superior) startup, and cloud users are messing with raw descriptor - >>>>> I'd need to go find every XDP program out there and make sure it >>>>> understands the new descriptors. >>>> >>>> Here is another perspective: >>>> >>>> As AF_XDP application developer I don't wan't to deal with the >>>> underlying hardware in detail. I like to request a feature from the OS >>>> (in this case rx/tx timestamping). If the feature is available I will >>>> simply use it, if not I might have to work around it - maybe by falling >>>> back to SW timestamping. >>>> >>>> All parts of my application (BPF program included) should not be >>>> optimized/adjusted for all the different HW variants out there. >>> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware >>> differences is the whole *point* of having an OS/driver model. I.e., >>> it's what the kernel is there for! If people want to bypass that and get >>> direct access to the hardware, they can already do that by using DPDK. >>> >>> So in other words, 100% agreed that we should not expect the BPF >>> developers to deal with hardware details as would be required with a >>> kptr-based interface. >>> >>> As for the kfunc-based interface, I think it shows some promise. >>> Exposing a list of function names to retrieve individual metadata items >>> instead of a struct layout is sorta comparable in terms of developer UI >>> accessibility etc (IMO). >> >> Looks like there are quite some use cases for hw_timestamp. >> Do you think we could add it to the uapi like struct xdp_md? >> >> The following is the current xdp_md: >> struct xdp_md { >> __u32 data; >> __u32 data_end; >> __u32 data_meta; >> /* Below access go through struct xdp_rxq_info */ >> __u32 ingress_ifindex; /* rxq->dev->ifindex */ >> __u32 rx_queue_index; /* rxq->queue_index */ >> >> __u32 egress_ifindex; /* txq->dev->ifindex */ >> }; >> >> We could add __u64 hw_timestamp to the xdp_md so user >> can just do xdp_md->hw_timestamp to get the value. >> xdp_md->hw_timestamp == 0 means hw_timestamp is not >> available. >> >> Inside the kernel, the ctx rewriter can generate code >> to call driver specific function to retrieve the data. > > If the driver generates the code to retrieve the data, how's that > different from the kfunc approach? > The only difference I see is that it would be a more strong UAPI than > the kfuncs? Right. it is a strong uapi. > >> The kfunc approach can be used to *less* common use cases? > > What's the advantage of having two approaches when one can cover > common and uncommon cases? Beyond hw_timestamp, do we have any other fields ready to support? If it ends up with lots of fields to be accessed by the bpf program, and bpf program actually intends to access these fields, using a strong uapi might be a good thing as it can make code much streamlined. > >>> There are three main drawbacks, AFAICT: >>> >>> 1. It requires driver developers to write and maintain the code that >>> generates the unrolled BPF bytecode to access the metadata fields, which >>> is a non-trivial amount of complexity. Maybe this can be abstracted away >>> with some internal helpers though (like, e.g., a >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out >>> the required JMP/MOV/LDX instructions? >>> >>> 2. AF_XDP programs won't be able to access the metadata without using a >>> custom XDP program that calls the kfuncs and puts the data into the >>> metadata area. We could solve this with some code in libxdp, though; if >>> this code can be made generic enough (so it just dumps the available >>> metadata functions from the running kernel at load time), it may be >>> possible to make it generic enough that it will be forward-compatible >>> with new versions of the kernel that add new fields, which should >>> alleviate Florian's concern about keeping things in sync. >>> >>> 3. It will make it harder to consume the metadata when building SKBs. I >>> think the CPUMAP and veth use cases are also quite important, and that >>> we want metadata to be available for building SKBs in this path. Maybe >>> this can be resolved by having a convenient kfunc for this that can be >>> used for programs doing such redirects. E.g., you could just call >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>> would recursively expand into all the kfunc calls needed to extract the >>> metadata supported by the SKB path? >>> >>> -Toke >>>
On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 10/31/22 3:09 PM, Stanislav Fomichev wrote: > > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote: > >> > >> > >> > >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: > >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > >>> > >>>> Hi all, > >>>> > >>>> I was closely following this discussion for some time now. Seems we > >>>> reached the point where it's getting interesting for me. > >>>> > >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: > >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > >>>>>>>> And it's actually harder to abstract away inter HW generation > >>>>>>>> differences if the user space code has to handle all of it. > >>>>>> > >>>>>> I don't see how its any harder in practice though? > >>>>> > >>>>> You need to find out what HW/FW/config you're running, right? > >>>>> And all you have is a pointer to a blob of unknown type. > >>>>> > >>>>> Take timestamps for example, some NICs support adjusting the PHC > >>>>> or doing SW corrections (with different versions of hw/fw/server > >>>>> platforms being capable of both/one/neither). > >>>>> > >>>>> Sure you can extract all this info with tracing and careful > >>>>> inspection via uAPI. But I don't think that's _easier_. > >>>>> And the vendors can't run the results thru their validation > >>>>> (for whatever that's worth). > >>>>> > >>>>>>> I've had the same concern: > >>>>>>> > >>>>>>> Until we have some userspace library that abstracts all these details, > >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob > >>>>>>> of data and I need to go through the code and see what particular type > >>>>>>> it represents for my particular device and how the data I need is > >>>>>>> represented there. There are also these "if this is device v1 -> use > >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc" > >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put > >>>>>>> this burden on the driver developers, but I agree that the drawback > >>>>>>> here is that we actually have to wait for the implementations to catch > >>>>>>> up. > >>>>>> > >>>>>> I agree with everything there, you will get a blob of data and then > >>>>>> will need to know what field you want to read using BTF. But, we > >>>>>> already do this for BPF programs all over the place so its not a big > >>>>>> lift for us. All other BPF tracing/observability requires the same > >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only > >>>>>> place left to write BPF programs without thinking about BTF and > >>>>>> kernel data structures. > >>>>>> > >>>>>> But, with proposed kptr the complexity lives in userspace and can be > >>>>>> fixed, added, updated without having to bother with kernel updates, etc. > >>>>>> From my point of view of supporting Cilium its a win and much preferred > >>>>>> to having to deal with driver owners on all cloud vendors, distributions, > >>>>>> and so on. > >>>>>> > >>>>>> If vendor updates firmware with new fields I get those immediately. > >>>>> > >>>>> Conversely it's a valid concern that those who *do* actually update > >>>>> their kernel regularly will have more things to worry about. > >>>>> > >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf > >>>>>>> programs might have to take a lot of other state into consideration > >>>>>>> when parsing the descriptors; all those details do seem like they > >>>>>>> belong to the driver code. > >>>>>> > >>>>>> I would prefer to avoid being stuck on requiring driver writers to > >>>>>> be involved. With just a kptr I can support the device and any > >>>>>> firwmare versions without requiring help. > >>>>> > >>>>> 1) where are you getting all those HW / FW specs :S > >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S > >>>>> > >>>>>>> Feel free to send it early with just a handful of drivers implemented; > >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some > >>>>>>> nice sample/test case that shows how the metadata can be used, that > >>>>>>> might push us closer to the agreement on the best way to proceed. > >>>>>> > >>>>>> I'll try to do a intel and mlx implementation to get a cross section. > >>>>>> I have a good collection of nics here so should be able to show a > >>>>>> couple firmware versions. It could be fine I think to have the raw > >>>>>> kptr access and then also kfuncs for some things perhaps. > >>>>>> > >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor > >>>>>>>> parsing to user space will indeed result in what you just said - major > >>>>>>>> vendors are supported and that's it. > >>>>>> > >>>>>> I'm not sure about why it would make it harder for new vendors? I think > >>>>>> the opposite, > >>>>> > >>>>> TBH I'm only replying to the email because of the above part :) > >>>>> I thought this would be self evident, but I guess our perspectives > >>>>> are different. > >>>>> > >>>>> Perhaps you look at it from the perspective of SW running on someone > >>>>> else's cloud, an being able to move to another cloud, without having > >>>>> to worry if feature X is available in xdp or just skb. > >>>>> > >>>>> I look at it from the perspective of maintaining a cloud, with people > >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a > >>>>> (superior) startup, and cloud users are messing with raw descriptor - > >>>>> I'd need to go find every XDP program out there and make sure it > >>>>> understands the new descriptors. > >>>> > >>>> Here is another perspective: > >>>> > >>>> As AF_XDP application developer I don't wan't to deal with the > >>>> underlying hardware in detail. I like to request a feature from the OS > >>>> (in this case rx/tx timestamping). If the feature is available I will > >>>> simply use it, if not I might have to work around it - maybe by falling > >>>> back to SW timestamping. > >>>> > >>>> All parts of my application (BPF program included) should not be > >>>> optimized/adjusted for all the different HW variants out there. > >>> > >>> Yes, absolutely agreed. Abstracting away those kinds of hardware > >>> differences is the whole *point* of having an OS/driver model. I.e., > >>> it's what the kernel is there for! If people want to bypass that and get > >>> direct access to the hardware, they can already do that by using DPDK. > >>> > >>> So in other words, 100% agreed that we should not expect the BPF > >>> developers to deal with hardware details as would be required with a > >>> kptr-based interface. > >>> > >>> As for the kfunc-based interface, I think it shows some promise. > >>> Exposing a list of function names to retrieve individual metadata items > >>> instead of a struct layout is sorta comparable in terms of developer UI > >>> accessibility etc (IMO). > >> > >> Looks like there are quite some use cases for hw_timestamp. > >> Do you think we could add it to the uapi like struct xdp_md? > >> > >> The following is the current xdp_md: > >> struct xdp_md { > >> __u32 data; > >> __u32 data_end; > >> __u32 data_meta; > >> /* Below access go through struct xdp_rxq_info */ > >> __u32 ingress_ifindex; /* rxq->dev->ifindex */ > >> __u32 rx_queue_index; /* rxq->queue_index */ > >> > >> __u32 egress_ifindex; /* txq->dev->ifindex */ > >> }; > >> > >> We could add __u64 hw_timestamp to the xdp_md so user > >> can just do xdp_md->hw_timestamp to get the value. > >> xdp_md->hw_timestamp == 0 means hw_timestamp is not > >> available. > >> > >> Inside the kernel, the ctx rewriter can generate code > >> to call driver specific function to retrieve the data. > > > > If the driver generates the code to retrieve the data, how's that > > different from the kfunc approach? > > The only difference I see is that it would be a more strong UAPI than > > the kfuncs? > > Right. it is a strong uapi. > > > > >> The kfunc approach can be used to *less* common use cases? > > > > What's the advantage of having two approaches when one can cover > > common and uncommon cases? > > Beyond hw_timestamp, do we have any other fields ready to support? > > If it ends up with lots of fields to be accessed by the bpf program, > and bpf program actually intends to access these fields, > using a strong uapi might be a good thing as it can make code > much streamlined. There are a bunch. Alexander's series has a good list: https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8 We can definitely call some of them more "common" than the others, but not sure how strong of a definition that would be. > > > >>> There are three main drawbacks, AFAICT: > >>> > >>> 1. It requires driver developers to write and maintain the code that > >>> generates the unrolled BPF bytecode to access the metadata fields, which > >>> is a non-trivial amount of complexity. Maybe this can be abstracted away > >>> with some internal helpers though (like, e.g., a > >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out > >>> the required JMP/MOV/LDX instructions? > >>> > >>> 2. AF_XDP programs won't be able to access the metadata without using a > >>> custom XDP program that calls the kfuncs and puts the data into the > >>> metadata area. We could solve this with some code in libxdp, though; if > >>> this code can be made generic enough (so it just dumps the available > >>> metadata functions from the running kernel at load time), it may be > >>> possible to make it generic enough that it will be forward-compatible > >>> with new versions of the kernel that add new fields, which should > >>> alleviate Florian's concern about keeping things in sync. > >>> > >>> 3. It will make it harder to consume the metadata when building SKBs. I > >>> think the CPUMAP and veth use cases are also quite important, and that > >>> we want metadata to be available for building SKBs in this path. Maybe > >>> this can be resolved by having a convenient kfunc for this that can be > >>> used for programs doing such redirects. E.g., you could just call > >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > >>> would recursively expand into all the kfunc calls needed to extract the > >>> metadata supported by the SKB path? > >>> > >>> -Toke > >>>
On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >> 2. AF_XDP programs won't be able to access the metadata without using a >> custom XDP program that calls the kfuncs and puts the data into the >> metadata area. We could solve this with some code in libxdp, though; if >> this code can be made generic enough (so it just dumps the available >> metadata functions from the running kernel at load time), it may be >> possible to make it generic enough that it will be forward-compatible >> with new versions of the kernel that add new fields, which should >> alleviate Florian's concern about keeping things in sync. > > Good point. I had to convert to a custom program to use the kfuncs :-( > But your suggestion sounds good; maybe libxdp can accept some extra > info about at which offset the user would like to place the metadata > and the library can generate the required bytecode? > >> 3. It will make it harder to consume the metadata when building SKBs. I >> think the CPUMAP and veth use cases are also quite important, and that >> we want metadata to be available for building SKBs in this path. Maybe >> this can be resolved by having a convenient kfunc for this that can be >> used for programs doing such redirects. E.g., you could just call >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >> would recursively expand into all the kfunc calls needed to extract the >> metadata supported by the SKB path? > > So this xdp_copy_metadata_for_skb will create a metadata layout that Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? Not sure where is the best point to specify this prog though. Somehow during bpf_xdp_redirect_map? or this prog belongs to the target cpumap and the xdp prog redirecting to this cpumap has to write the meta layout in a way that the cpumap is expecting? > the kernel will be able to understand when converting back to skb? > IIUC, the xdp program will look something like the following: > > if (xdp packet is to be consumed by af_xdp) { > // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your > own metadata layout > return bpf_redirect_map(xsk, ...); > } else { > // if the packet is to be consumed by the kernel > xdp_copy_metadata_for_skb(ctx); > return bpf_redirect(...); > } > > Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe > put some magic number in the first byte(s) of the metadata so the > kernel can check whether xdp_copy_metadata_for_skb has been called > previously (or maybe xdp_frame can carry this extra signal, idk).
On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/31/22 10:00 AM, Stanislav Fomichev wrote: > >> 2. AF_XDP programs won't be able to access the metadata without using a > >> custom XDP program that calls the kfuncs and puts the data into the > >> metadata area. We could solve this with some code in libxdp, though; if > >> this code can be made generic enough (so it just dumps the available > >> metadata functions from the running kernel at load time), it may be > >> possible to make it generic enough that it will be forward-compatible > >> with new versions of the kernel that add new fields, which should > >> alleviate Florian's concern about keeping things in sync. > > > > Good point. I had to convert to a custom program to use the kfuncs :-( > > But your suggestion sounds good; maybe libxdp can accept some extra > > info about at which offset the user would like to place the metadata > > and the library can generate the required bytecode? > > > >> 3. It will make it harder to consume the metadata when building SKBs. I > >> think the CPUMAP and veth use cases are also quite important, and that > >> we want metadata to be available for building SKBs in this path. Maybe > >> this can be resolved by having a convenient kfunc for this that can be > >> used for programs doing such redirects. E.g., you could just call > >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > >> would recursively expand into all the kfunc calls needed to extract the > >> metadata supported by the SKB path? > > > > So this xdp_copy_metadata_for_skb will create a metadata layout that > > Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? > Not sure where is the best point to specify this prog though. Somehow during > bpf_xdp_redirect_map? > or this prog belongs to the target cpumap and the xdp prog redirecting to this > cpumap has to write the meta layout in a way that the cpumap is expecting? We're probably interested in triggering it from the places where xdp frames can eventually be converted into skbs? So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, anything that's not XDP_DROP / AF_XDP redirect). We can probably make it magically work, and can generate kernel-digestible metadata whenever data == data_meta, but the question - should we? (need to make sure we won't regress any existing cases that are not relying on the metadata) > > the kernel will be able to understand when converting back to skb? > > IIUC, the xdp program will look something like the following: > > > > if (xdp packet is to be consumed by af_xdp) { > > // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your > > own metadata layout > > return bpf_redirect_map(xsk, ...); > > } else { > > // if the packet is to be consumed by the kernel > > xdp_copy_metadata_for_skb(ctx); > > return bpf_redirect(...); > > } > > > > Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe > > put some magic number in the first byte(s) of the metadata so the > > kernel can check whether xdp_copy_metadata_for_skb has been called > > previously (or maybe xdp_frame can carry this extra signal, idk).
Stanislav Fomichev <sdf@google.com> writes: > On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >> >> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >> >> 2. AF_XDP programs won't be able to access the metadata without using a >> >> custom XDP program that calls the kfuncs and puts the data into the >> >> metadata area. We could solve this with some code in libxdp, though; if >> >> this code can be made generic enough (so it just dumps the available >> >> metadata functions from the running kernel at load time), it may be >> >> possible to make it generic enough that it will be forward-compatible >> >> with new versions of the kernel that add new fields, which should >> >> alleviate Florian's concern about keeping things in sync. >> > >> > Good point. I had to convert to a custom program to use the kfuncs :-( >> > But your suggestion sounds good; maybe libxdp can accept some extra >> > info about at which offset the user would like to place the metadata >> > and the library can generate the required bytecode? >> > >> >> 3. It will make it harder to consume the metadata when building SKBs. I >> >> think the CPUMAP and veth use cases are also quite important, and that >> >> we want metadata to be available for building SKBs in this path. Maybe >> >> this can be resolved by having a convenient kfunc for this that can be >> >> used for programs doing such redirects. E.g., you could just call >> >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >> >> would recursively expand into all the kfunc calls needed to extract the >> >> metadata supported by the SKB path? >> > >> > So this xdp_copy_metadata_for_skb will create a metadata layout that >> >> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >> Not sure where is the best point to specify this prog though. Somehow during >> bpf_xdp_redirect_map? >> or this prog belongs to the target cpumap and the xdp prog redirecting to this >> cpumap has to write the meta layout in a way that the cpumap is expecting? > > We're probably interested in triggering it from the places where xdp > frames can eventually be converted into skbs? > So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, > anything that's not XDP_DROP / AF_XDP redirect). > We can probably make it magically work, and can generate > kernel-digestible metadata whenever data == data_meta, but the > question - should we? > (need to make sure we won't regress any existing cases that are not > relying on the metadata) So I was thinking about whether we could have the kernel do this automatically, and concluded that this was probably not feasible in general, which is why I suggested the explicit helper. My reasoning was as follows: For straight XDP_PASS in the driver we don't actually need to do anything today, as the driver itself will build the SKB and read any metadata it needs from the HW descriptor[0]. This leaves packets that are redirected (either to a veth or a cpumap so we build SKBs from them later); here the problem is that we buffer the packets (for performance reasons) so that the redirect doesn't actually happen until after the driver exits the NAPI loop. At which point we don't have access to the HW descriptors anymore, so we can't actually read the metadata. This means that if we want to execute the metadata gathering automatically, we'd have to do it in xdp_do_redirect(). Which means that we'll have to figure out, at that point, whether the XDP frame is likely to be converted to an SKB. This will add at least one branch (and probably more) that will be in-path for every redirected frame. Hence, making it up to the XDP program itself to decide whether it will need the metadata for SKB conversion seems like a better choice, as long as we make it easy for the XDP program to do this. Instead of a helper, this could also simply be a new flag to the bpf_redirect{,_map}() helpers (either opt-in or opt-out depending on the overhead), which would be even simpler? I.e., return bpf_redirect_map(&cpumap, 0, BPF_F_PREPARE_SKB_METADATA); -Toke [0] As an aside, in the future drivers may want to take advantage of the XDP-specific metadata reading also when building SKBs (so it doesn't have to implement it in both BPF and C code). For this, we could expose a new internal helper function that the drivers could call to simply execute the XDP-to-skb metadata helpers the same way the stack/helper does.
On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote: > Stanislav Fomichev <sdf@google.com> writes: > >> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >>> >>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>> 2. AF_XDP programs won't be able to access the metadata without using a >>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>> metadata area. We could solve this with some code in libxdp, though; if >>>>> this code can be made generic enough (so it just dumps the available >>>>> metadata functions from the running kernel at load time), it may be >>>>> possible to make it generic enough that it will be forward-compatible >>>>> with new versions of the kernel that add new fields, which should >>>>> alleviate Florian's concern about keeping things in sync. >>>> >>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>> info about at which offset the user would like to place the metadata >>>> and the library can generate the required bytecode? >>>> >>>>> 3. It will make it harder to consume the metadata when building SKBs. I >>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>> this can be resolved by having a convenient kfunc for this that can be >>>>> used for programs doing such redirects. E.g., you could just call >>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>> would recursively expand into all the kfunc calls needed to extract the >>>>> metadata supported by the SKB path? >>>> >>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>> >>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>> Not sure where is the best point to specify this prog though. Somehow during >>> bpf_xdp_redirect_map? >>> or this prog belongs to the target cpumap and the xdp prog redirecting to this >>> cpumap has to write the meta layout in a way that the cpumap is expecting? >> >> We're probably interested in triggering it from the places where xdp >> frames can eventually be converted into skbs? >> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >> anything that's not XDP_DROP / AF_XDP redirect). >> We can probably make it magically work, and can generate >> kernel-digestible metadata whenever data == data_meta, but the >> question - should we? >> (need to make sure we won't regress any existing cases that are not >> relying on the metadata) > > So I was thinking about whether we could have the kernel do this > automatically, and concluded that this was probably not feasible in > general, which is why I suggested the explicit helper. My reasoning was > as follows: > > For straight XDP_PASS in the driver we don't actually need to do > anything today, as the driver itself will build the SKB and read any > metadata it needs from the HW descriptor[0]. The program can pop encap headers, mpls tags, ... and thus affect the metadata in the descriptor (besides the timestamp). > > This leaves packets that are redirected (either to a veth or a cpumap so > we build SKBs from them later); here the problem is that we buffer the > packets (for performance reasons) so that the redirect doesn't actually > happen until after the driver exits the NAPI loop. At which point we > don't have access to the HW descriptors anymore, so we can't actually > read the metadata. > > This means that if we want to execute the metadata gathering > automatically, we'd have to do it in xdp_do_redirect(). Which means that > we'll have to figure out, at that point, whether the XDP frame is likely > to be converted to an SKB. This will add at least one branch (and > probably more) that will be in-path for every redirected frame. or forwarded to a tun device as an xdp frame and wanting to pass metadata into a VM which may construct an skb in the guest. This case is arguably aligned with the redirect from vendor1 to vendor2. This thread (and others) seem to be focused on the Rx path, but the Tx path is equally important with similar needs.
David Ahern <dsahern@gmail.com> writes: > On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote: >> Stanislav Fomichev <sdf@google.com> writes: >> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >>>> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>>> 2. AF_XDP programs won't be able to access the metadata without using a >>>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>>> metadata area. We could solve this with some code in libxdp, though; if >>>>>> this code can be made generic enough (so it just dumps the available >>>>>> metadata functions from the running kernel at load time), it may be >>>>>> possible to make it generic enough that it will be forward-compatible >>>>>> with new versions of the kernel that add new fields, which should >>>>>> alleviate Florian's concern about keeping things in sync. >>>>> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>>> info about at which offset the user would like to place the metadata >>>>> and the library can generate the required bytecode? >>>>> >>>>>> 3. It will make it harder to consume the metadata when building SKBs. I >>>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>>> this can be resolved by having a convenient kfunc for this that can be >>>>>> used for programs doing such redirects. E.g., you could just call >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>>> would recursively expand into all the kfunc calls needed to extract the >>>>>> metadata supported by the SKB path? >>>>> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>>> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>>> Not sure where is the best point to specify this prog though. Somehow during >>>> bpf_xdp_redirect_map? >>>> or this prog belongs to the target cpumap and the xdp prog redirecting to this >>>> cpumap has to write the meta layout in a way that the cpumap is expecting? >>> >>> We're probably interested in triggering it from the places where xdp >>> frames can eventually be converted into skbs? >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >>> anything that's not XDP_DROP / AF_XDP redirect). >>> We can probably make it magically work, and can generate >>> kernel-digestible metadata whenever data == data_meta, but the >>> question - should we? >>> (need to make sure we won't regress any existing cases that are not >>> relying on the metadata) >> >> So I was thinking about whether we could have the kernel do this >> automatically, and concluded that this was probably not feasible in >> general, which is why I suggested the explicit helper. My reasoning was >> as follows: >> >> For straight XDP_PASS in the driver we don't actually need to do >> anything today, as the driver itself will build the SKB and read any >> metadata it needs from the HW descriptor[0]. > > The program can pop encap headers, mpls tags, ... and thus affect the > metadata in the descriptor (besides the timestamp). Hmm, right, good point. How does XDP_PASS deal with that today, though? I guess this is an argument for making the "read HW metadata into SKB format" thing be a kfunc/helper rather than a flag to bpf_redirect(), then. Because then we can allow the XDP program to override/modify the metadata afterwards, either by defining it as: int xdp_copy_metadata_for_skb(struct xdp_md *ctx, struct xdp_skb_meta *override, int flags) where the XDP program can fill in 'override' with new data that takes precedence over the stuff from the HW (like a modified checksum or offset or something). Or we can just have xdp_copy_metadata_for_skb() into the regular XDP metadata area, and let the XDP program modify it afterwards. I feel like the override argument would be easier to use, though. Also, having it be completely opaque *where* the metadata is stored when using xdp_copy_metadata_for_skb() lets us be more flexible about it. E.g., the helper could write the timestamp directly into skb_shared_info, instead of stuffing it into the metadata area where it then has to be copied out later. >> This leaves packets that are redirected (either to a veth or a cpumap so >> we build SKBs from them later); here the problem is that we buffer the >> packets (for performance reasons) so that the redirect doesn't actually >> happen until after the driver exits the NAPI loop. At which point we >> don't have access to the HW descriptors anymore, so we can't actually >> read the metadata. >> >> This means that if we want to execute the metadata gathering >> automatically, we'd have to do it in xdp_do_redirect(). Which means that >> we'll have to figure out, at that point, whether the XDP frame is likely >> to be converted to an SKB. This will add at least one branch (and >> probably more) that will be in-path for every redirected frame. > > or forwarded to a tun device as an xdp frame and wanting to pass > metadata into a VM which may construct an skb in the guest. This case is > arguably aligned with the redirect from vendor1 to vendor2. > > This thread (and others) seem to be focused on the Rx path, but the Tx > path is equally important with similar needs. You're right, of course. Thinking a bit out loud here, but I actually think the kfunc approach makes the TX side easier: We already have to ability to execute a second "TX" XDP program inside the devmaps. At which point that program is also tied to a particular interface. So we could duplicate the RX-side kfunc trick, and expose a set of *writer* kfuncs for metadata. So that an XDP program in the devmap can simply do: if (bpf_xdp_metadata_tx_timestamp_supported()) bpf_xdp_metadata_tx_timestamp(ctx, tsval); and those two kfuncs will be unrolled by the TX-side driver as well to store them wherever they need to go to reach the wire. The one complication here being, of course, that by the time the devmap XDP program is executed, the driver hasn't seen the frame at all, yet, so it doesn't have anywhere to store that data. We'd need to reuse the frame metadata area for this (with some flag indicating that it's valid), or we'd need a new area the driver could use as scratch space specific to the xdp_frame (like the skb->cb field, I suppose). -Toke
On 31/10/2022 23.55, Stanislav Fomichev wrote: > On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song<yhs@meta.com> wrote: >> >> On 10/31/22 3:09 PM, Stanislav Fomichev wrote: >>> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song<yhs@meta.com> wrote: >>>> >>>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: >>>>> "Bezdeka, Florian"<florian.bezdeka@siemens.com> writes: >>>>>> >>>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: >>>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: [...] >>>>>> All parts of my application (BPF program included) should not be >>>>>> optimized/adjusted for all the different HW variants out there. >>>>> Yes, absolutely agreed. Abstracting away those kinds of hardware >>>>> differences is the whole*point* of having an OS/driver model. I.e., >>>>> it's what the kernel is there for! If people want to bypass that and get >>>>> direct access to the hardware, they can already do that by using DPDK. >>>>> >>>>> So in other words, 100% agreed that we should not expect the BPF >>>>> developers to deal with hardware details as would be required with a >>>>> kptr-based interface. >>>>> >>>>> As for the kfunc-based interface, I think it shows some promise. >>>>> Exposing a list of function names to retrieve individual metadata items >>>>> instead of a struct layout is sorta comparable in terms of developer UI >>>>> accessibility etc (IMO). >>>> >>>> Looks like there are quite some use cases for hw_timestamp. >>>> Do you think we could add it to the uapi like struct xdp_md? >>>> >>>> The following is the current xdp_md: >>>> struct xdp_md { >>>> __u32 data; >>>> __u32 data_end; >>>> __u32 data_meta; >>>> /* Below access go through struct xdp_rxq_info */ >>>> __u32 ingress_ifindex; /* rxq->dev->ifindex */ >>>> __u32 rx_queue_index; /* rxq->queue_index */ >>>> >>>> __u32 egress_ifindex; /* txq->dev->ifindex */ >>>> }; >>>> >>>> We could add __u64 hw_timestamp to the xdp_md so user >>>> can just do xdp_md->hw_timestamp to get the value. >>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not >>>> available. >>>> >>>> Inside the kernel, the ctx rewriter can generate code >>>> to call driver specific function to retrieve the data. >>> If the driver generates the code to retrieve the data, how's that >>> different from the kfunc approach? >>> The only difference I see is that it would be a more strong UAPI than >>> the kfuncs? >> Right. it is a strong uapi. >> >>>> The kfunc approach can be used to*less* common use cases? >>> What's the advantage of having two approaches when one can cover >>> common and uncommon cases? >> >> Beyond hw_timestamp, do we have any other fields ready to support? >> >> If it ends up with lots of fields to be accessed by the bpf program, >> and bpf program actually intends to access these fields, >> using a strong uapi might be a good thing as it can make code >> much streamlined. > > There are a bunch. Alexander's series has a good list: > > https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8 > Below are the fields I've identified, which are close to what Alexander also found. struct xdp_hints_common { union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; u16 rx_queue; u16 vlan_tci; u32 rx_hash32; u32 xdp_hints_flags; u64 btf_full_id; /* BTF object + type ID */ } __attribute__((aligned(4))) __attribute__((packed)); Some of the fields are encoded via flags: enum xdp_hints_flags { HINT_FLAG_CSUM_TYPE_BIT0 = BIT(0), HINT_FLAG_CSUM_TYPE_BIT1 = BIT(1), HINT_FLAG_CSUM_TYPE_MASK = 0x3, HINT_FLAG_CSUM_LEVEL_BIT0 = BIT(2), HINT_FLAG_CSUM_LEVEL_BIT1 = BIT(3), HINT_FLAG_CSUM_LEVEL_MASK = 0xC, HINT_FLAG_CSUM_LEVEL_SHIFT = 2, HINT_FLAG_RX_HASH_TYPE_BIT0 = BIT(4), HINT_FLAG_RX_HASH_TYPE_BIT1 = BIT(5), HINT_FLAG_RX_HASH_TYPE_MASK = 0x30, HINT_FLAG_RX_HASH_TYPE_SHIFT = 0x4, HINT_FLAG_RX_QUEUE = BIT(7), HINT_FLAG_VLAN_PRESENT = BIT(8), HINT_FLAG_VLAN_PROTO_ETH_P_8021Q = BIT(9), HINT_FLAG_VLAN_PROTO_ETH_P_8021AD = BIT(10), /* Flags from BIT(16) can be used by drivers */ }; > We can definitely call some of them more "common" than the others, but > not sure how strong of a definition that would be. The important fields that would be worth considering as UAPI candidates are: (1) RX-hash, (2) Hash-type and (3) RX-checksum. With these three we can avoid calling the flow-dissector and GRO frame aggregations works. (This currently hurts xdp_frame to SKB performance a lot in practice). *BUT* in it's current form above (incl. Alexanders approach/patch) it would be a mistake to UAPI standardize the "(2) Hash-type" in this simplified "reduced" form (which is what the SKB "needs"). There is a huge untapped potential in the Hash-type. Thanks to Microsoft almost all NIC hardware provided a Hash-type that gives us the L3-protocol (IPv4 or IPv6) and the L4-protocol (UDP or TCP and sometimes SCTP), plus info if extention-headers are provided. (Digging in datasheets, we can often also get the header-size). Think about how many cycles XDP BPF-prog can save parsing protocol headers. I'm also hoping we can leveregate this to allow SKBs created from an xdp_frame to have skb->transport_header and skb->network_header pre-populated (and skip some of these netstack layers). --Jesper p.s. in my patchset, I exposed the "raw" Hash-type bits from the descriptor in hope this would evolve.
On 10/31/22 6:59 PM, Stanislav Fomichev wrote: > On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: >> >> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>> 2. AF_XDP programs won't be able to access the metadata without using a >>>> custom XDP program that calls the kfuncs and puts the data into the >>>> metadata area. We could solve this with some code in libxdp, though; if >>>> this code can be made generic enough (so it just dumps the available >>>> metadata functions from the running kernel at load time), it may be >>>> possible to make it generic enough that it will be forward-compatible >>>> with new versions of the kernel that add new fields, which should >>>> alleviate Florian's concern about keeping things in sync. >>> >>> Good point. I had to convert to a custom program to use the kfuncs :-( >>> But your suggestion sounds good; maybe libxdp can accept some extra >>> info about at which offset the user would like to place the metadata >>> and the library can generate the required bytecode? >>> >>>> 3. It will make it harder to consume the metadata when building SKBs. I >>>> think the CPUMAP and veth use cases are also quite important, and that >>>> we want metadata to be available for building SKBs in this path. Maybe >>>> this can be resolved by having a convenient kfunc for this that can be >>>> used for programs doing such redirects. E.g., you could just call >>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>> would recursively expand into all the kfunc calls needed to extract the >>>> metadata supported by the SKB path? >>> >>> So this xdp_copy_metadata_for_skb will create a metadata layout that >> >> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >> Not sure where is the best point to specify this prog though. Somehow during >> bpf_xdp_redirect_map? >> or this prog belongs to the target cpumap and the xdp prog redirecting to this >> cpumap has to write the meta layout in a way that the cpumap is expecting? > > We're probably interested in triggering it from the places where xdp > frames can eventually be converted into skbs? > So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, > anything that's not XDP_DROP / AF_XDP redirect). > We can probably make it magically work, and can generate > kernel-digestible metadata whenever data == data_meta, but the > question - should we? > (need to make sure we won't regress any existing cases that are not > relying on the metadata) Instead of having some kernel-digestible meta data, how about calling another bpf prog to initialize the skb fields from the meta area after __xdp_build_skb_from_frame() in the cpumap, so run_xdp_set_skb_fileds_from_metadata() may be a better name. The xdp_prog@rx sets the meta data and then redirect. If the xdp_prog@rx can also specify a xdp prog to initialize the skb fields from the meta area, then there is no need to have a kfunc to enforce a kernel-digestible layout. Not sure what is a good way to specify this xdp_prog though... >>> the kernel will be able to understand when converting back to skb? >>> IIUC, the xdp program will look something like the following: >>> >>> if (xdp packet is to be consumed by af_xdp) { >>> // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your >>> own metadata layout >>> return bpf_redirect_map(xsk, ...); >>> } else { >>> // if the packet is to be consumed by the kernel >>> xdp_copy_metadata_for_skb(ctx); >>> return bpf_redirect(...); >>> } >>> >>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe >>> put some magic number in the first byte(s) of the metadata so the >>> kernel can check whether xdp_copy_metadata_for_skb has been called >>> previously (or maybe xdp_frame can carry this extra signal, idk).
On 10/31/22 3:09 PM, Stanislav Fomichev wrote: > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote: >> >> >> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: >>> >>>> Hi all, >>>> >>>> I was closely following this discussion for some time now. Seems we >>>> reached the point where it's getting interesting for me. >>>> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: >>>>>>>> And it's actually harder to abstract away inter HW generation >>>>>>>> differences if the user space code has to handle all of it. >>>>>> >>>>>> I don't see how its any harder in practice though? >>>>> >>>>> You need to find out what HW/FW/config you're running, right? >>>>> And all you have is a pointer to a blob of unknown type. >>>>> >>>>> Take timestamps for example, some NICs support adjusting the PHC >>>>> or doing SW corrections (with different versions of hw/fw/server >>>>> platforms being capable of both/one/neither). >>>>> >>>>> Sure you can extract all this info with tracing and careful >>>>> inspection via uAPI. But I don't think that's _easier_. >>>>> And the vendors can't run the results thru their validation >>>>> (for whatever that's worth). >>>>> >>>>>>> I've had the same concern: >>>>>>> >>>>>>> Until we have some userspace library that abstracts all these details, >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob >>>>>>> of data and I need to go through the code and see what particular type >>>>>>> it represents for my particular device and how the data I need is >>>>>>> represented there. There are also these "if this is device v1 -> use >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc" >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put >>>>>>> this burden on the driver developers, but I agree that the drawback >>>>>>> here is that we actually have to wait for the implementations to catch >>>>>>> up. >>>>>> >>>>>> I agree with everything there, you will get a blob of data and then >>>>>> will need to know what field you want to read using BTF. But, we >>>>>> already do this for BPF programs all over the place so its not a big >>>>>> lift for us. All other BPF tracing/observability requires the same >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only >>>>>> place left to write BPF programs without thinking about BTF and >>>>>> kernel data structures. >>>>>> >>>>>> But, with proposed kptr the complexity lives in userspace and can be >>>>>> fixed, added, updated without having to bother with kernel updates, etc. >>>>>> From my point of view of supporting Cilium its a win and much preferred >>>>>> to having to deal with driver owners on all cloud vendors, distributions, >>>>>> and so on. >>>>>> >>>>>> If vendor updates firmware with new fields I get those immediately. >>>>> >>>>> Conversely it's a valid concern that those who *do* actually update >>>>> their kernel regularly will have more things to worry about. >>>>> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf >>>>>>> programs might have to take a lot of other state into consideration >>>>>>> when parsing the descriptors; all those details do seem like they >>>>>>> belong to the driver code. >>>>>> >>>>>> I would prefer to avoid being stuck on requiring driver writers to >>>>>> be involved. With just a kptr I can support the device and any >>>>>> firwmare versions without requiring help. >>>>> >>>>> 1) where are you getting all those HW / FW specs :S >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S >>>>> >>>>>>> Feel free to send it early with just a handful of drivers implemented; >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some >>>>>>> nice sample/test case that shows how the metadata can be used, that >>>>>>> might push us closer to the agreement on the best way to proceed. >>>>>> >>>>>> I'll try to do a intel and mlx implementation to get a cross section. >>>>>> I have a good collection of nics here so should be able to show a >>>>>> couple firmware versions. It could be fine I think to have the raw >>>>>> kptr access and then also kfuncs for some things perhaps. >>>>>> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor >>>>>>>> parsing to user space will indeed result in what you just said - major >>>>>>>> vendors are supported and that's it. >>>>>> >>>>>> I'm not sure about why it would make it harder for new vendors? I think >>>>>> the opposite, >>>>> >>>>> TBH I'm only replying to the email because of the above part :) >>>>> I thought this would be self evident, but I guess our perspectives >>>>> are different. >>>>> >>>>> Perhaps you look at it from the perspective of SW running on someone >>>>> else's cloud, an being able to move to another cloud, without having >>>>> to worry if feature X is available in xdp or just skb. >>>>> >>>>> I look at it from the perspective of maintaining a cloud, with people >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a >>>>> (superior) startup, and cloud users are messing with raw descriptor - >>>>> I'd need to go find every XDP program out there and make sure it >>>>> understands the new descriptors. >>>> >>>> Here is another perspective: >>>> >>>> As AF_XDP application developer I don't wan't to deal with the >>>> underlying hardware in detail. I like to request a feature from the OS >>>> (in this case rx/tx timestamping). If the feature is available I will >>>> simply use it, if not I might have to work around it - maybe by falling >>>> back to SW timestamping. >>>> >>>> All parts of my application (BPF program included) should not be >>>> optimized/adjusted for all the different HW variants out there. >>> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware >>> differences is the whole *point* of having an OS/driver model. I.e., >>> it's what the kernel is there for! If people want to bypass that and get >>> direct access to the hardware, they can already do that by using DPDK. >>> >>> So in other words, 100% agreed that we should not expect the BPF >>> developers to deal with hardware details as would be required with a >>> kptr-based interface. >>> >>> As for the kfunc-based interface, I think it shows some promise. >>> Exposing a list of function names to retrieve individual metadata items >>> instead of a struct layout is sorta comparable in terms of developer UI >>> accessibility etc (IMO). >> >> Looks like there are quite some use cases for hw_timestamp. >> Do you think we could add it to the uapi like struct xdp_md? >> >> The following is the current xdp_md: >> struct xdp_md { >> __u32 data; >> __u32 data_end; >> __u32 data_meta; >> /* Below access go through struct xdp_rxq_info */ >> __u32 ingress_ifindex; /* rxq->dev->ifindex */ >> __u32 rx_queue_index; /* rxq->queue_index */ >> >> __u32 egress_ifindex; /* txq->dev->ifindex */ >> }; >> >> We could add __u64 hw_timestamp to the xdp_md so user >> can just do xdp_md->hw_timestamp to get the value. >> xdp_md->hw_timestamp == 0 means hw_timestamp is not >> available. >> >> Inside the kernel, the ctx rewriter can generate code >> to call driver specific function to retrieve the data. > > If the driver generates the code to retrieve the data, how's that > different from the kfunc approach? > The only difference I see is that it would be a more strong UAPI than > the kfuncs? Another thing may be worth considering, some hints for some HW/driver may be harder (or may not worth) to unroll/inline. For example, I see driver is doing spin_lock_bh while getting the hwtstamp. For this case, keep calling a kfunc and avoid the unroll/inline may be the right thing to do. > >> The kfunc approach can be used to *less* common use cases? > > What's the advantage of having two approaches when one can cover > common and uncommon cases? > >>> There are three main drawbacks, AFAICT: >>> >>> 1. It requires driver developers to write and maintain the code that >>> generates the unrolled BPF bytecode to access the metadata fields, which >>> is a non-trivial amount of complexity. Maybe this can be abstracted away >>> with some internal helpers though (like, e.g., a >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out >>> the required JMP/MOV/LDX instructions? >>> >>> 2. AF_XDP programs won't be able to access the metadata without using a >>> custom XDP program that calls the kfuncs and puts the data into the >>> metadata area. We could solve this with some code in libxdp, though; if >>> this code can be made generic enough (so it just dumps the available >>> metadata functions from the running kernel at load time), it may be >>> possible to make it generic enough that it will be forward-compatible >>> with new versions of the kernel that add new fields, which should >>> alleviate Florian's concern about keeping things in sync. >>> >>> 3. It will make it harder to consume the metadata when building SKBs. I >>> think the CPUMAP and veth use cases are also quite important, and that >>> we want metadata to be available for building SKBs in this path. Maybe >>> this can be resolved by having a convenient kfunc for this that can be >>> used for programs doing such redirects. E.g., you could just call >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>> would recursively expand into all the kfunc calls needed to extract the >>> metadata supported by the SKB path? >>> >>> -Toke >>>
On Tue, Nov 1, 2022 at 10:31 AM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/31/22 3:09 PM, Stanislav Fomichev wrote: > > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote: > >> > >> > >> > >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote: > >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes: > >>> > >>>> Hi all, > >>>> > >>>> I was closely following this discussion for some time now. Seems we > >>>> reached the point where it's getting interesting for me. > >>>> > >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote: > >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote: > >>>>>>>> And it's actually harder to abstract away inter HW generation > >>>>>>>> differences if the user space code has to handle all of it. > >>>>>> > >>>>>> I don't see how its any harder in practice though? > >>>>> > >>>>> You need to find out what HW/FW/config you're running, right? > >>>>> And all you have is a pointer to a blob of unknown type. > >>>>> > >>>>> Take timestamps for example, some NICs support adjusting the PHC > >>>>> or doing SW corrections (with different versions of hw/fw/server > >>>>> platforms being capable of both/one/neither). > >>>>> > >>>>> Sure you can extract all this info with tracing and careful > >>>>> inspection via uAPI. But I don't think that's _easier_. > >>>>> And the vendors can't run the results thru their validation > >>>>> (for whatever that's worth). > >>>>> > >>>>>>> I've had the same concern: > >>>>>>> > >>>>>>> Until we have some userspace library that abstracts all these details, > >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob > >>>>>>> of data and I need to go through the code and see what particular type > >>>>>>> it represents for my particular device and how the data I need is > >>>>>>> represented there. There are also these "if this is device v1 -> use > >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc" > >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put > >>>>>>> this burden on the driver developers, but I agree that the drawback > >>>>>>> here is that we actually have to wait for the implementations to catch > >>>>>>> up. > >>>>>> > >>>>>> I agree with everything there, you will get a blob of data and then > >>>>>> will need to know what field you want to read using BTF. But, we > >>>>>> already do this for BPF programs all over the place so its not a big > >>>>>> lift for us. All other BPF tracing/observability requires the same > >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only > >>>>>> place left to write BPF programs without thinking about BTF and > >>>>>> kernel data structures. > >>>>>> > >>>>>> But, with proposed kptr the complexity lives in userspace and can be > >>>>>> fixed, added, updated without having to bother with kernel updates, etc. > >>>>>> From my point of view of supporting Cilium its a win and much preferred > >>>>>> to having to deal with driver owners on all cloud vendors, distributions, > >>>>>> and so on. > >>>>>> > >>>>>> If vendor updates firmware with new fields I get those immediately. > >>>>> > >>>>> Conversely it's a valid concern that those who *do* actually update > >>>>> their kernel regularly will have more things to worry about. > >>>>> > >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf > >>>>>>> programs might have to take a lot of other state into consideration > >>>>>>> when parsing the descriptors; all those details do seem like they > >>>>>>> belong to the driver code. > >>>>>> > >>>>>> I would prefer to avoid being stuck on requiring driver writers to > >>>>>> be involved. With just a kptr I can support the device and any > >>>>>> firwmare versions without requiring help. > >>>>> > >>>>> 1) where are you getting all those HW / FW specs :S > >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S > >>>>> > >>>>>>> Feel free to send it early with just a handful of drivers implemented; > >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some > >>>>>>> nice sample/test case that shows how the metadata can be used, that > >>>>>>> might push us closer to the agreement on the best way to proceed. > >>>>>> > >>>>>> I'll try to do a intel and mlx implementation to get a cross section. > >>>>>> I have a good collection of nics here so should be able to show a > >>>>>> couple firmware versions. It could be fine I think to have the raw > >>>>>> kptr access and then also kfuncs for some things perhaps. > >>>>>> > >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor > >>>>>>>> parsing to user space will indeed result in what you just said - major > >>>>>>>> vendors are supported and that's it. > >>>>>> > >>>>>> I'm not sure about why it would make it harder for new vendors? I think > >>>>>> the opposite, > >>>>> > >>>>> TBH I'm only replying to the email because of the above part :) > >>>>> I thought this would be self evident, but I guess our perspectives > >>>>> are different. > >>>>> > >>>>> Perhaps you look at it from the perspective of SW running on someone > >>>>> else's cloud, an being able to move to another cloud, without having > >>>>> to worry if feature X is available in xdp or just skb. > >>>>> > >>>>> I look at it from the perspective of maintaining a cloud, with people > >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a > >>>>> (superior) startup, and cloud users are messing with raw descriptor - > >>>>> I'd need to go find every XDP program out there and make sure it > >>>>> understands the new descriptors. > >>>> > >>>> Here is another perspective: > >>>> > >>>> As AF_XDP application developer I don't wan't to deal with the > >>>> underlying hardware in detail. I like to request a feature from the OS > >>>> (in this case rx/tx timestamping). If the feature is available I will > >>>> simply use it, if not I might have to work around it - maybe by falling > >>>> back to SW timestamping. > >>>> > >>>> All parts of my application (BPF program included) should not be > >>>> optimized/adjusted for all the different HW variants out there. > >>> > >>> Yes, absolutely agreed. Abstracting away those kinds of hardware > >>> differences is the whole *point* of having an OS/driver model. I.e., > >>> it's what the kernel is there for! If people want to bypass that and get > >>> direct access to the hardware, they can already do that by using DPDK. > >>> > >>> So in other words, 100% agreed that we should not expect the BPF > >>> developers to deal with hardware details as would be required with a > >>> kptr-based interface. > >>> > >>> As for the kfunc-based interface, I think it shows some promise. > >>> Exposing a list of function names to retrieve individual metadata items > >>> instead of a struct layout is sorta comparable in terms of developer UI > >>> accessibility etc (IMO). > >> > >> Looks like there are quite some use cases for hw_timestamp. > >> Do you think we could add it to the uapi like struct xdp_md? > >> > >> The following is the current xdp_md: > >> struct xdp_md { > >> __u32 data; > >> __u32 data_end; > >> __u32 data_meta; > >> /* Below access go through struct xdp_rxq_info */ > >> __u32 ingress_ifindex; /* rxq->dev->ifindex */ > >> __u32 rx_queue_index; /* rxq->queue_index */ > >> > >> __u32 egress_ifindex; /* txq->dev->ifindex */ > >> }; > >> > >> We could add __u64 hw_timestamp to the xdp_md so user > >> can just do xdp_md->hw_timestamp to get the value. > >> xdp_md->hw_timestamp == 0 means hw_timestamp is not > >> available. > >> > >> Inside the kernel, the ctx rewriter can generate code > >> to call driver specific function to retrieve the data. > > > > If the driver generates the code to retrieve the data, how's that > > different from the kfunc approach? > > The only difference I see is that it would be a more strong UAPI than > > the kfuncs? > > Another thing may be worth considering, some hints for some HW/driver may be > harder (or may not worth) to unroll/inline. For example, I see driver is doing > spin_lock_bh while getting the hwtstamp. For this case, keep calling a kfunc > and avoid the unroll/inline may be the right thing to do. Yeah, I'm trying to look at the drivers right now and doing spinlocks/seqlocks might complicate the story... But it seems like in the worst case, the unrolled bytecode can always resort to calling a kernel function? (we might need to have some scratch area to preserve r1-r5 and we can't touch r6-r9 because we are not in a real call, but seems doable; I'll try to illustrate with a bunch of examples) > >> The kfunc approach can be used to *less* common use cases? > > > > What's the advantage of having two approaches when one can cover > > common and uncommon cases? > > > >>> There are three main drawbacks, AFAICT: > >>> > >>> 1. It requires driver developers to write and maintain the code that > >>> generates the unrolled BPF bytecode to access the metadata fields, which > >>> is a non-trivial amount of complexity. Maybe this can be abstracted away > >>> with some internal helpers though (like, e.g., a > >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out > >>> the required JMP/MOV/LDX instructions? > >>> > >>> 2. AF_XDP programs won't be able to access the metadata without using a > >>> custom XDP program that calls the kfuncs and puts the data into the > >>> metadata area. We could solve this with some code in libxdp, though; if > >>> this code can be made generic enough (so it just dumps the available > >>> metadata functions from the running kernel at load time), it may be > >>> possible to make it generic enough that it will be forward-compatible > >>> with new versions of the kernel that add new fields, which should > >>> alleviate Florian's concern about keeping things in sync. > >>> > >>> 3. It will make it harder to consume the metadata when building SKBs. I > >>> think the CPUMAP and veth use cases are also quite important, and that > >>> we want metadata to be available for building SKBs in this path. Maybe > >>> this can be resolved by having a convenient kfunc for this that can be > >>> used for programs doing such redirects. E.g., you could just call > >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > >>> would recursively expand into all the kfunc calls needed to extract the > >>> metadata supported by the SKB path? > >>> > >>> -Toke > >>> >
On Tue, Nov 1, 2022 at 10:05 AM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 10/31/22 6:59 PM, Stanislav Fomichev wrote: > > On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > >> > >> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: > >>>> 2. AF_XDP programs won't be able to access the metadata without using a > >>>> custom XDP program that calls the kfuncs and puts the data into the > >>>> metadata area. We could solve this with some code in libxdp, though; if > >>>> this code can be made generic enough (so it just dumps the available > >>>> metadata functions from the running kernel at load time), it may be > >>>> possible to make it generic enough that it will be forward-compatible > >>>> with new versions of the kernel that add new fields, which should > >>>> alleviate Florian's concern about keeping things in sync. > >>> > >>> Good point. I had to convert to a custom program to use the kfuncs :-( > >>> But your suggestion sounds good; maybe libxdp can accept some extra > >>> info about at which offset the user would like to place the metadata > >>> and the library can generate the required bytecode? > >>> > >>>> 3. It will make it harder to consume the metadata when building SKBs. I > >>>> think the CPUMAP and veth use cases are also quite important, and that > >>>> we want metadata to be available for building SKBs in this path. Maybe > >>>> this can be resolved by having a convenient kfunc for this that can be > >>>> used for programs doing such redirects. E.g., you could just call > >>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > >>>> would recursively expand into all the kfunc calls needed to extract the > >>>> metadata supported by the SKB path? > >>> > >>> So this xdp_copy_metadata_for_skb will create a metadata layout that > >> > >> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? > >> Not sure where is the best point to specify this prog though. Somehow during > >> bpf_xdp_redirect_map? > >> or this prog belongs to the target cpumap and the xdp prog redirecting to this > >> cpumap has to write the meta layout in a way that the cpumap is expecting? > > > > We're probably interested in triggering it from the places where xdp > > frames can eventually be converted into skbs? > > So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, > > anything that's not XDP_DROP / AF_XDP redirect). > > We can probably make it magically work, and can generate > > kernel-digestible metadata whenever data == data_meta, but the > > question - should we? > > (need to make sure we won't regress any existing cases that are not > > relying on the metadata) > > Instead of having some kernel-digestible meta data, how about calling another > bpf prog to initialize the skb fields from the meta area after > __xdp_build_skb_from_frame() in the cpumap, so > run_xdp_set_skb_fileds_from_metadata() may be a better name. > > The xdp_prog@rx sets the meta data and then redirect. If the xdp_prog@rx can > also specify a xdp prog to initialize the skb fields from the meta area, then > there is no need to have a kfunc to enforce a kernel-digestible layout. Not > sure what is a good way to specify this xdp_prog though... Not sure also about whether doing it at this point is too late or not? Need to take a closer look at all __xdp_build_skb_from_frame call sites... Also see Toke/Dave discussing potentially having helpers to override some of that metadata. In this case, having more control on the user side makes sense.. I'll probably start with an explicit helper for now, just to see if the overall approach is workable. Maybe we can have a follow up discussion about doing it more transparently. > >>> the kernel will be able to understand when converting back to skb? > >>> IIUC, the xdp program will look something like the following: > >>> > >>> if (xdp packet is to be consumed by af_xdp) { > >>> // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your > >>> own metadata layout > >>> return bpf_redirect_map(xsk, ...); > >>> } else { > >>> // if the packet is to be consumed by the kernel > >>> xdp_copy_metadata_for_skb(ctx); > >>> return bpf_redirect(...); > >>> } > >>> > >>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe > >>> put some magic number in the first byte(s) of the metadata so the > >>> kernel can check whether xdp_copy_metadata_for_skb has been called > >>> previously (or maybe xdp_frame can carry this extra signal, idk). >
On 11/1/22 1:12 PM, Stanislav Fomichev wrote: >>>>> As for the kfunc-based interface, I think it shows some promise. >>>>> Exposing a list of function names to retrieve individual metadata items >>>>> instead of a struct layout is sorta comparable in terms of developer UI >>>>> accessibility etc (IMO). >>>> >>>> Looks like there are quite some use cases for hw_timestamp. >>>> Do you think we could add it to the uapi like struct xdp_md? >>>> >>>> The following is the current xdp_md: >>>> struct xdp_md { >>>> __u32 data; >>>> __u32 data_end; >>>> __u32 data_meta; >>>> /* Below access go through struct xdp_rxq_info */ >>>> __u32 ingress_ifindex; /* rxq->dev->ifindex */ >>>> __u32 rx_queue_index; /* rxq->queue_index */ >>>> >>>> __u32 egress_ifindex; /* txq->dev->ifindex */ >>>> }; >>>> >>>> We could add __u64 hw_timestamp to the xdp_md so user >>>> can just do xdp_md->hw_timestamp to get the value. >>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not >>>> available. >>>> >>>> Inside the kernel, the ctx rewriter can generate code >>>> to call driver specific function to retrieve the data. >>> >>> If the driver generates the code to retrieve the data, how's that >>> different from the kfunc approach? >>> The only difference I see is that it would be a more strong UAPI than >>> the kfuncs? >> >> Another thing may be worth considering, some hints for some HW/driver may be >> harder (or may not worth) to unroll/inline. For example, I see driver is doing >> spin_lock_bh while getting the hwtstamp. For this case, keep calling a kfunc >> and avoid the unroll/inline may be the right thing to do. > > Yeah, I'm trying to look at the drivers right now and doing > spinlocks/seqlocks might complicate the story... > But it seems like in the worst case, the unrolled bytecode can always > resort to calling a kernel function? unroll the common cases and call kernel function for everything else? that should be doable. The bpf prog calling it as kfunc will have more flexibility like this here. > (we might need to have some scratch area to preserve r1-r5 and we > can't touch r6-r9 because we are not in a real call, but seems doable; > I'll try to illustrate with a bunch of examples) > > >>>> The kfunc approach can be used to *less* common use cases? >>> >>> What's the advantage of having two approaches when one can cover >>> common and uncommon cases? >>> >>>>> There are three main drawbacks, AFAICT: >>>>> >>>>> 1. It requires driver developers to write and maintain the code that >>>>> generates the unrolled BPF bytecode to access the metadata fields, which >>>>> is a non-trivial amount of complexity. Maybe this can be abstracted away >>>>> with some internal helpers though (like, e.g., a >>>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out >>>>> the required JMP/MOV/LDX instructions? >>>>> >>>>> 2. AF_XDP programs won't be able to access the metadata without using a >>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>> metadata area. We could solve this with some code in libxdp, though; if >>>>> this code can be made generic enough (so it just dumps the available >>>>> metadata functions from the running kernel at load time), it may be >>>>> possible to make it generic enough that it will be forward-compatible >>>>> with new versions of the kernel that add new fields, which should >>>>> alleviate Florian's concern about keeping things in sync. >>>>> >>>>> 3. It will make it harder to consume the metadata when building SKBs. I >>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>> this can be resolved by having a convenient kfunc for this that can be >>>>> used for programs doing such redirects. E.g., you could just call >>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>> would recursively expand into all the kfunc calls needed to extract the >>>>> metadata supported by the SKB path? >>>>> >>>>> -Toke >>>>> >>
On 01/11/2022 18.05, Martin KaFai Lau wrote: > On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >> <martin.lau@linux.dev> wrote: >>> >>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>> 2. AF_XDP programs won't be able to access the metadata without >>>>> using a >>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>> metadata area. We could solve this with some code in libxdp, >>>>> though; if >>>>> this code can be made generic enough (so it just dumps the available >>>>> metadata functions from the running kernel at load time), it may be >>>>> possible to make it generic enough that it will be forward-compatible >>>>> with new versions of the kernel that add new fields, which should >>>>> alleviate Florian's concern about keeping things in sync. >>>> >>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>> info about at which offset the user would like to place the metadata >>>> and the library can generate the required bytecode? >>>> >>>>> 3. It will make it harder to consume the metadata when building >>>>> SKBs. I >>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>> this can be resolved by having a convenient kfunc for this that can be >>>>> used for programs doing such redirects. E.g., you could just call >>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>> would recursively expand into all the kfunc calls needed to extract >>>>> the >>>>> metadata supported by the SKB path? >>>> >>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>> >>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>> Not sure where is the best point to specify this prog though. >>> Somehow during >>> bpf_xdp_redirect_map? >>> or this prog belongs to the target cpumap and the xdp prog >>> redirecting to this >>> cpumap has to write the meta layout in a way that the cpumap is >>> expecting? >> >> We're probably interested in triggering it from the places where xdp >> frames can eventually be converted into skbs? >> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >> anything that's not XDP_DROP / AF_XDP redirect). >> We can probably make it magically work, and can generate >> kernel-digestible metadata whenever data == data_meta, but the >> question - should we? >> (need to make sure we won't regress any existing cases that are not >> relying on the metadata) > > Instead of having some kernel-digestible meta data, how about calling > another bpf prog to initialize the skb fields from the meta area after > __xdp_build_skb_from_frame() in the cpumap, so > run_xdp_set_skb_fileds_from_metadata() may be a better name. > I very much like this idea of calling another bpf prog to initialize the SKB fields from the meta area. (As a reminder, data need to come from meta area, because at this point the hardware RX-desc is out-of-scope). I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as context inputs. Right? (Not sure, if this is acceptable with the BPF maintainers new rules) > The xdp_prog@rx sets the meta data and then redirect. If the > xdp_prog@rx can also specify a xdp prog to initialize the skb fields > from the meta area, then there is no need to have a kfunc to enforce a > kernel-digestible layout. Not sure what is a good way to specify this > xdp_prog though... The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside __xdp_build_skb_from_frame() is that it need to know howto decode the meta area for every device driver or XDP-prog populating this (as veth and cpumap can get redirected packets from multiple device drivers). Sure, using a common function/helper/macro like xdp_copy_metadata_for_skb() could help reduce this multiplexing, but we want to have maximum flexibility to extend this without having to update the kernel, right. Fortunately __xdp_build_skb_from_frame() have a net_device parameter, that points to the device is was received on (xdp_frame->dev_rx). Thus, we could extend net_device and add this BPF-prog on a per net_device basis. This could function as a driver BPF-prog callback that populates the SKB fields from the XDP meta data. Is this a good or bad idea? >>>> the kernel will be able to understand when converting back to skb? >>>> IIUC, the xdp program will look something like the following: >>>> >>>> if (xdp packet is to be consumed by af_xdp) { >>>> // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble >>>> your >>>> own metadata layout >>>> return bpf_redirect_map(xsk, ...); >>>> } else { >>>> // if the packet is to be consumed by the kernel >>>> xdp_copy_metadata_for_skb(ctx); >>>> return bpf_redirect(...); >>>> } >>>> >>>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe >>>> put some magic number in the first byte(s) of the metadata so the >>>> kernel can check whether xdp_copy_metadata_for_skb has been called >>>> previously (or maybe xdp_frame can carry this extra signal, idk). I'm in favor of adding a flag bit to xdp_frame to signal this. In __xdp_build_skb_from_frame() we could check this flag signal and then invoke the BPF-prog type BPF_PROG_TYPE_XDP2SKB. --Jesper
Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > On 01/11/2022 18.05, Martin KaFai Lau wrote: >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >>> <martin.lau@linux.dev> wrote: >>>> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>>> 2. AF_XDP programs won't be able to access the metadata without >>>>>> using a >>>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>>> metadata area. We could solve this with some code in libxdp, >>>>>> though; if >>>>>> this code can be made generic enough (so it just dumps the available >>>>>> metadata functions from the running kernel at load time), it may be >>>>>> possible to make it generic enough that it will be forward-compatible >>>>>> with new versions of the kernel that add new fields, which should >>>>>> alleviate Florian's concern about keeping things in sync. >>>>> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>>> info about at which offset the user would like to place the metadata >>>>> and the library can generate the required bytecode? >>>>> >>>>>> 3. It will make it harder to consume the metadata when building >>>>>> SKBs. I >>>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>>> this can be resolved by having a convenient kfunc for this that can be >>>>>> used for programs doing such redirects. E.g., you could just call >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>>> would recursively expand into all the kfunc calls needed to extract >>>>>> the >>>>>> metadata supported by the SKB path? >>>>> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>>> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>>> Not sure where is the best point to specify this prog though. >>>> Somehow during >>>> bpf_xdp_redirect_map? >>>> or this prog belongs to the target cpumap and the xdp prog >>>> redirecting to this >>>> cpumap has to write the meta layout in a way that the cpumap is >>>> expecting? >>> >>> We're probably interested in triggering it from the places where xdp >>> frames can eventually be converted into skbs? >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >>> anything that's not XDP_DROP / AF_XDP redirect). >>> We can probably make it magically work, and can generate >>> kernel-digestible metadata whenever data == data_meta, but the >>> question - should we? >>> (need to make sure we won't regress any existing cases that are not >>> relying on the metadata) >> >> Instead of having some kernel-digestible meta data, how about calling >> another bpf prog to initialize the skb fields from the meta area after >> __xdp_build_skb_from_frame() in the cpumap, so >> run_xdp_set_skb_fileds_from_metadata() may be a better name. >> > > I very much like this idea of calling another bpf prog to initialize the > SKB fields from the meta area. (As a reminder, data need to come from > meta area, because at this point the hardware RX-desc is out-of-scope). > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. > > We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). > > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as > context inputs. Right? (Not sure, if this is acceptable with the BPF > maintainers new rules) > >> The xdp_prog@rx sets the meta data and then redirect. If the >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields >> from the meta area, then there is no need to have a kfunc to enforce a >> kernel-digestible layout. Not sure what is a good way to specify this >> xdp_prog though... > > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside > __xdp_build_skb_from_frame() is that it need to know howto decode the > meta area for every device driver or XDP-prog populating this (as veth > and cpumap can get redirected packets from multiple device drivers). If we have the helper to copy the data "out of" the drivers, why do we need a second BPF program to copy data to the SKB? I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes each of the kfuncs needed for the metadata used by SKBs, all of which get unrolled. The helper takes the output of these metadata-extracting kfuncs and stores it "somewhere". This "somewhere" could well be the metadata area; but in any case, since it's hidden away inside a helper (or kfunc) from the calling XDP program's PoV, the helper can just stash all the data in a fixed format, which __xdp_build_skb_from_frame() can then just read statically. We could even make this format match the field layout of struct sk_buff, so all we have to do is memcpy a contiguous chunk of memory when building the SKB. > Sure, using a common function/helper/macro like > xdp_copy_metadata_for_skb() could help reduce this multiplexing, but > we want to have maximum flexibility to extend this without having to > update the kernel, right. The extension mechanism is in which kfuncs are available to XDP programs to extract metadata. The kernel then just becomes another consumer of those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could also be other kfuncs added that are not used for skbs (even vendor-specific ones if we want to allow that). -Toke
On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > > > On 01/11/2022 18.05, Martin KaFai Lau wrote: > >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: > >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau > >>> <martin.lau@linux.dev> wrote: > >>>> > >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: > >>>>>> 2. AF_XDP programs won't be able to access the metadata without > >>>>>> using a > >>>>>> custom XDP program that calls the kfuncs and puts the data into the > >>>>>> metadata area. We could solve this with some code in libxdp, > >>>>>> though; if > >>>>>> this code can be made generic enough (so it just dumps the available > >>>>>> metadata functions from the running kernel at load time), it may be > >>>>>> possible to make it generic enough that it will be forward-compatible > >>>>>> with new versions of the kernel that add new fields, which should > >>>>>> alleviate Florian's concern about keeping things in sync. > >>>>> > >>>>> Good point. I had to convert to a custom program to use the kfuncs :-( > >>>>> But your suggestion sounds good; maybe libxdp can accept some extra > >>>>> info about at which offset the user would like to place the metadata > >>>>> and the library can generate the required bytecode? > >>>>> > >>>>>> 3. It will make it harder to consume the metadata when building > >>>>>> SKBs. I > >>>>>> think the CPUMAP and veth use cases are also quite important, and that > >>>>>> we want metadata to be available for building SKBs in this path. Maybe > >>>>>> this can be resolved by having a convenient kfunc for this that can be > >>>>>> used for programs doing such redirects. E.g., you could just call > >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that > >>>>>> would recursively expand into all the kfunc calls needed to extract > >>>>>> the > >>>>>> metadata supported by the SKB path? > >>>>> > >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that > >>>> > >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? > >>>> Not sure where is the best point to specify this prog though. > >>>> Somehow during > >>>> bpf_xdp_redirect_map? > >>>> or this prog belongs to the target cpumap and the xdp prog > >>>> redirecting to this > >>>> cpumap has to write the meta layout in a way that the cpumap is > >>>> expecting? > >>> > >>> We're probably interested in triggering it from the places where xdp > >>> frames can eventually be converted into skbs? > >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, > >>> anything that's not XDP_DROP / AF_XDP redirect). > >>> We can probably make it magically work, and can generate > >>> kernel-digestible metadata whenever data == data_meta, but the > >>> question - should we? > >>> (need to make sure we won't regress any existing cases that are not > >>> relying on the metadata) > >> > >> Instead of having some kernel-digestible meta data, how about calling > >> another bpf prog to initialize the skb fields from the meta area after > >> __xdp_build_skb_from_frame() in the cpumap, so > >> run_xdp_set_skb_fileds_from_metadata() may be a better name. > >> > > > > I very much like this idea of calling another bpf prog to initialize the > > SKB fields from the meta area. (As a reminder, data need to come from > > meta area, because at this point the hardware RX-desc is out-of-scope). > > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. > > > > We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). > > > > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog > > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as > > context inputs. Right? (Not sure, if this is acceptable with the BPF > > maintainers new rules) > > > >> The xdp_prog@rx sets the meta data and then redirect. If the > >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields > >> from the meta area, then there is no need to have a kfunc to enforce a > >> kernel-digestible layout. Not sure what is a good way to specify this > >> xdp_prog though... > > > > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside > > __xdp_build_skb_from_frame() is that it need to know howto decode the > > meta area for every device driver or XDP-prog populating this (as veth > > and cpumap can get redirected packets from multiple device drivers). > > If we have the helper to copy the data "out of" the drivers, why do we > need a second BPF program to copy data to the SKB? > > I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes > each of the kfuncs needed for the metadata used by SKBs, all of which > get unrolled. The helper takes the output of these metadata-extracting > kfuncs and stores it "somewhere". This "somewhere" could well be the > metadata area; but in any case, since it's hidden away inside a helper > (or kfunc) from the calling XDP program's PoV, the helper can just stash > all the data in a fixed format, which __xdp_build_skb_from_frame() can > then just read statically. We could even make this format match the > field layout of struct sk_buff, so all we have to do is memcpy a > contiguous chunk of memory when building the SKB. +1 I'm currently doing exactly what you're suggesting (minus matching skb layout): struct xdp_to_skb_metadata { u32 magic; // randomized at boot ... skb-consumable-metadata in fixed format } __randomize_layout; bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx, -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs to fill in the actual data. Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel C code that parses that 'struct xdp_to_skb_metadata'. (To be precise, I'm trying to parse the metadata from skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not 100% sure that's the right place). (I also randomize the layout and magic to make sure userspace doesn't depend on it because nothing stops this packet to be routed into xsk socket..) > > Sure, using a common function/helper/macro like > > xdp_copy_metadata_for_skb() could help reduce this multiplexing, but > > we want to have maximum flexibility to extend this without having to > > update the kernel, right. > > The extension mechanism is in which kfuncs are available to XDP programs > to extract metadata. The kernel then just becomes another consumer of > those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could > also be other kfuncs added that are not used for skbs (even > vendor-specific ones if we want to allow that). > > -Toke >
Stanislav Fomichev <sdf@google.com> writes: > On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Jesper Dangaard Brouer <jbrouer@redhat.com> writes: >> >> > On 01/11/2022 18.05, Martin KaFai Lau wrote: >> >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >> >>> <martin.lau@linux.dev> wrote: >> >>>> >> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >> >>>>>> 2. AF_XDP programs won't be able to access the metadata without >> >>>>>> using a >> >>>>>> custom XDP program that calls the kfuncs and puts the data into the >> >>>>>> metadata area. We could solve this with some code in libxdp, >> >>>>>> though; if >> >>>>>> this code can be made generic enough (so it just dumps the available >> >>>>>> metadata functions from the running kernel at load time), it may be >> >>>>>> possible to make it generic enough that it will be forward-compatible >> >>>>>> with new versions of the kernel that add new fields, which should >> >>>>>> alleviate Florian's concern about keeping things in sync. >> >>>>> >> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >> >>>>> But your suggestion sounds good; maybe libxdp can accept some extra >> >>>>> info about at which offset the user would like to place the metadata >> >>>>> and the library can generate the required bytecode? >> >>>>> >> >>>>>> 3. It will make it harder to consume the metadata when building >> >>>>>> SKBs. I >> >>>>>> think the CPUMAP and veth use cases are also quite important, and that >> >>>>>> we want metadata to be available for building SKBs in this path. Maybe >> >>>>>> this can be resolved by having a convenient kfunc for this that can be >> >>>>>> used for programs doing such redirects. E.g., you could just call >> >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >> >>>>>> would recursively expand into all the kfunc calls needed to extract >> >>>>>> the >> >>>>>> metadata supported by the SKB path? >> >>>>> >> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >> >>>> >> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >> >>>> Not sure where is the best point to specify this prog though. >> >>>> Somehow during >> >>>> bpf_xdp_redirect_map? >> >>>> or this prog belongs to the target cpumap and the xdp prog >> >>>> redirecting to this >> >>>> cpumap has to write the meta layout in a way that the cpumap is >> >>>> expecting? >> >>> >> >>> We're probably interested in triggering it from the places where xdp >> >>> frames can eventually be converted into skbs? >> >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >> >>> anything that's not XDP_DROP / AF_XDP redirect). >> >>> We can probably make it magically work, and can generate >> >>> kernel-digestible metadata whenever data == data_meta, but the >> >>> question - should we? >> >>> (need to make sure we won't regress any existing cases that are not >> >>> relying on the metadata) >> >> >> >> Instead of having some kernel-digestible meta data, how about calling >> >> another bpf prog to initialize the skb fields from the meta area after >> >> __xdp_build_skb_from_frame() in the cpumap, so >> >> run_xdp_set_skb_fileds_from_metadata() may be a better name. >> >> >> > >> > I very much like this idea of calling another bpf prog to initialize the >> > SKB fields from the meta area. (As a reminder, data need to come from >> > meta area, because at this point the hardware RX-desc is out-of-scope). >> > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. >> > >> > We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). >> > >> > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog >> > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as >> > context inputs. Right? (Not sure, if this is acceptable with the BPF >> > maintainers new rules) >> > >> >> The xdp_prog@rx sets the meta data and then redirect. If the >> >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields >> >> from the meta area, then there is no need to have a kfunc to enforce a >> >> kernel-digestible layout. Not sure what is a good way to specify this >> >> xdp_prog though... >> > >> > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside >> > __xdp_build_skb_from_frame() is that it need to know howto decode the >> > meta area for every device driver or XDP-prog populating this (as veth >> > and cpumap can get redirected packets from multiple device drivers). >> >> If we have the helper to copy the data "out of" the drivers, why do we >> need a second BPF program to copy data to the SKB? >> >> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes >> each of the kfuncs needed for the metadata used by SKBs, all of which >> get unrolled. The helper takes the output of these metadata-extracting >> kfuncs and stores it "somewhere". This "somewhere" could well be the >> metadata area; but in any case, since it's hidden away inside a helper >> (or kfunc) from the calling XDP program's PoV, the helper can just stash >> all the data in a fixed format, which __xdp_build_skb_from_frame() can >> then just read statically. We could even make this format match the >> field layout of struct sk_buff, so all we have to do is memcpy a >> contiguous chunk of memory when building the SKB. > > +1 > > I'm currently doing exactly what you're suggesting (minus matching skb layout): > > struct xdp_to_skb_metadata { > u32 magic; // randomized at boot > ... skb-consumable-metadata in fixed format > } __randomize_layout; > > bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx, > -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs > to fill in the actual data. > > Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel > C code that parses that 'struct xdp_to_skb_metadata'. > (To be precise, I'm trying to parse the metadata from > skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not > 100% sure that's the right place). > (I also randomize the layout and magic to make sure userspace doesn't > depend on it because nothing stops this packet to be routed into xsk > socket..) Ah, nice trick with __randomize_layout - I agree we need to do something to prevent userspace from inadvertently starting to rely on this, and this seems like a great solution! Look forward to seeing what the whole thing looks like in a more complete form :) -Toke
On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote: > Stanislav Fomichev <sdf@google.com> writes: > >> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >>> >>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes: >>> >>>> On 01/11/2022 18.05, Martin KaFai Lau wrote: >>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >>>>>> <martin.lau@linux.dev> wrote: >>>>>>> >>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without >>>>>>>>> using a >>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>>>>>> metadata area. We could solve this with some code in libxdp, >>>>>>>>> though; if >>>>>>>>> this code can be made generic enough (so it just dumps the available >>>>>>>>> metadata functions from the running kernel at load time), it may be >>>>>>>>> possible to make it generic enough that it will be forward-compatible >>>>>>>>> with new versions of the kernel that add new fields, which should >>>>>>>>> alleviate Florian's concern about keeping things in sync. >>>>>>>> >>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>>>>>> info about at which offset the user would like to place the metadata >>>>>>>> and the library can generate the required bytecode? >>>>>>>> >>>>>>>>> 3. It will make it harder to consume the metadata when building >>>>>>>>> SKBs. I >>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>>>>>> this can be resolved by having a convenient kfunc for this that can be >>>>>>>>> used for programs doing such redirects. E.g., you could just call >>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>>>>>> would recursively expand into all the kfunc calls needed to extract >>>>>>>>> the >>>>>>>>> metadata supported by the SKB path? >>>>>>>> >>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>>>>>> >>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>>>>>> Not sure where is the best point to specify this prog though. >>>>>>> Somehow during >>>>>>> bpf_xdp_redirect_map? >>>>>>> or this prog belongs to the target cpumap and the xdp prog >>>>>>> redirecting to this >>>>>>> cpumap has to write the meta layout in a way that the cpumap is >>>>>>> expecting? >>>>>> >>>>>> We're probably interested in triggering it from the places where xdp >>>>>> frames can eventually be converted into skbs? >>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >>>>>> anything that's not XDP_DROP / AF_XDP redirect). >>>>>> We can probably make it magically work, and can generate >>>>>> kernel-digestible metadata whenever data == data_meta, but the >>>>>> question - should we? >>>>>> (need to make sure we won't regress any existing cases that are not >>>>>> relying on the metadata) >>>>> >>>>> Instead of having some kernel-digestible meta data, how about calling >>>>> another bpf prog to initialize the skb fields from the meta area after >>>>> __xdp_build_skb_from_frame() in the cpumap, so >>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name. >>>>> >>>> >>>> I very much like this idea of calling another bpf prog to initialize the >>>> SKB fields from the meta area. (As a reminder, data need to come from >>>> meta area, because at this point the hardware RX-desc is out-of-scope). >>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. >>>> >>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). >>>> >>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog >>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as >>>> context inputs. Right? (Not sure, if this is acceptable with the BPF >>>> maintainers new rules) >>>> >>>>> The xdp_prog@rx sets the meta data and then redirect. If the >>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields >>>>> from the meta area, then there is no need to have a kfunc to enforce a >>>>> kernel-digestible layout. Not sure what is a good way to specify this >>>>> xdp_prog though... >>>> >>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside >>>> __xdp_build_skb_from_frame() is that it need to know howto decode the >>>> meta area for every device driver or XDP-prog populating this (as veth >>>> and cpumap can get redirected packets from multiple device drivers). >>> >>> If we have the helper to copy the data "out of" the drivers, why do we >>> need a second BPF program to copy data to the SKB? >>> IMHO the second BPF program to populate the SKB is needed to add flexibility and extensibility. My end-goal here is to speedup packet parsing. This BPF-prog should (in time) be able to update skb->transport_header and skb->network_header. As I mentioned before, HW RX-hash already tell us the L3 and L4 protocols and in-most-cases header-len. Even without HW-hints, the XDP-prog likely have parsed the packet once. This parse information is lost today, and redone by netstack. What about storing this header parse info in meta data and re-use in this new XDP2SKB hook? The reason for suggesting this BPF-prog to be a callback, associated with the net_device, were that HW is going to differ on what HW hints that HW support. Thus, we can avoid a generic C-function that need to check for all the possible hints, and instead have a BPF-prog that only contains the code that is relevant for this net_device. >>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes >>> each of the kfuncs needed for the metadata used by SKBs, all of which >>> get unrolled. The helper takes the output of these metadata-extracting >>> kfuncs and stores it "somewhere". This "somewhere" could well be the >>> metadata area; but in any case, since it's hidden away inside a helper >>> (or kfunc) from the calling XDP program's PoV, the helper can just stash >>> all the data in a fixed format, which __xdp_build_skb_from_frame() can >>> then just read statically. We could even make this format match the >>> field layout of struct sk_buff, so all we have to do is memcpy a >>> contiguous chunk of memory when building the SKB. >> >> +1 Sorry, I think this "hiding" layout trick is going in a wrong direction. Imagine the use-case of cpumap redirect. The physical device XDP-prog calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it calls redirect into cpumap. On remote CPU, the xdp_frame is picked up, and then I want to run another XDP-prog that want to look at these HW-hints, and then likely call XDP_PASS which creates the SKB, also using these HW-hints. I take it, it would not be possible when using the xdp_copy_metadata_for_skb() helper? >> >> I'm currently doing exactly what you're suggesting (minus matching skb layout): >> >> struct xdp_to_skb_metadata { >> u32 magic; // randomized at boot >> ... skb-consumable-metadata in fixed format >> } __randomize_layout; >> >> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx, >> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs >> to fill in the actual data. >> >> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel >> C code that parses that 'struct xdp_to_skb_metadata'. >> (To be precise, I'm trying to parse the metadata from >> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not >> 100% sure that's the right place). >> (I also randomize the layout and magic to make sure userspace doesn't >> depend on it because nothing stops this packet to be routed into xsk >> socket..) > > Ah, nice trick with __randomize_layout - I agree we need to do something > to prevent userspace from inadvertently starting to rely on this, and > this seems like a great solution! Sorry, I disagree where this is going. Why do all of a sudden want to prevent userspace (e.g. AF_XDP) from using this data?!? The hole exercise started with wanting to provide AF_XDP with these HW-hints. The hints a standard AF_XDP user wants is likely very similar to what the SKB user wants. Why do the AF_XDP user need to open code this? The BTF-ID scheme precisely allows us to expose this layout to userspace, and at the same time have freedom to change this in kernel space, as userspace must decode the BTF-layout before reading this. I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to ask for? You can just consider the BTF-ID as the magic number, as it will be more-or-less random per kernel build (and module load order). > Look forward to seeing what the whole thing looks like in a more > complete form :) I'm sort of on-board with the kfuncs and unroll-tricks, if I can see some driver code that handles the issues of getting HW setup state exposed needed to decode the RX-desc format. I sense that I myself, haven't been good enough to explain/convey the BTF-ID scheme. Next week, I will code some examples that demo how BTF-IDs can be used from BPF-progs, even as a communication channel between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog -> TC-BPF). --Jesper
Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote: >> Stanislav Fomichev <sdf@google.com> writes: >> >>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >>>> >>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes: >>>> >>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote: >>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >>>>>>> <martin.lau@linux.dev> wrote: >>>>>>>> >>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without >>>>>>>>>> using a >>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>>>>>>> metadata area. We could solve this with some code in libxdp, >>>>>>>>>> though; if >>>>>>>>>> this code can be made generic enough (so it just dumps the available >>>>>>>>>> metadata functions from the running kernel at load time), it may be >>>>>>>>>> possible to make it generic enough that it will be forward-compatible >>>>>>>>>> with new versions of the kernel that add new fields, which should >>>>>>>>>> alleviate Florian's concern about keeping things in sync. >>>>>>>>> >>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>>>>>>> info about at which offset the user would like to place the metadata >>>>>>>>> and the library can generate the required bytecode? >>>>>>>>> >>>>>>>>>> 3. It will make it harder to consume the metadata when building >>>>>>>>>> SKBs. I >>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be >>>>>>>>>> used for programs doing such redirects. E.g., you could just call >>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>>>>>>> would recursively expand into all the kfunc calls needed to extract >>>>>>>>>> the >>>>>>>>>> metadata supported by the SKB path? >>>>>>>>> >>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>>>>>>> >>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>>>>>>> Not sure where is the best point to specify this prog though. >>>>>>>> Somehow during >>>>>>>> bpf_xdp_redirect_map? >>>>>>>> or this prog belongs to the target cpumap and the xdp prog >>>>>>>> redirecting to this >>>>>>>> cpumap has to write the meta layout in a way that the cpumap is >>>>>>>> expecting? >>>>>>> >>>>>>> We're probably interested in triggering it from the places where xdp >>>>>>> frames can eventually be converted into skbs? >>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >>>>>>> anything that's not XDP_DROP / AF_XDP redirect). >>>>>>> We can probably make it magically work, and can generate >>>>>>> kernel-digestible metadata whenever data == data_meta, but the >>>>>>> question - should we? >>>>>>> (need to make sure we won't regress any existing cases that are not >>>>>>> relying on the metadata) >>>>>> >>>>>> Instead of having some kernel-digestible meta data, how about calling >>>>>> another bpf prog to initialize the skb fields from the meta area after >>>>>> __xdp_build_skb_from_frame() in the cpumap, so >>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name. >>>>>> >>>>> >>>>> I very much like this idea of calling another bpf prog to initialize the >>>>> SKB fields from the meta area. (As a reminder, data need to come from >>>>> meta area, because at this point the hardware RX-desc is out-of-scope). >>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. >>>>> >>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). >>>>> >>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog >>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as >>>>> context inputs. Right? (Not sure, if this is acceptable with the BPF >>>>> maintainers new rules) >>>>> >>>>>> The xdp_prog@rx sets the meta data and then redirect. If the >>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields >>>>>> from the meta area, then there is no need to have a kfunc to enforce a >>>>>> kernel-digestible layout. Not sure what is a good way to specify this >>>>>> xdp_prog though... >>>>> >>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside >>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the >>>>> meta area for every device driver or XDP-prog populating this (as veth >>>>> and cpumap can get redirected packets from multiple device drivers). >>>> >>>> If we have the helper to copy the data "out of" the drivers, why do we >>>> need a second BPF program to copy data to the SKB? >>>> > > IMHO the second BPF program to populate the SKB is needed to add > flexibility and extensibility. > > My end-goal here is to speedup packet parsing. > This BPF-prog should (in time) be able to update skb->transport_header > and skb->network_header. As I mentioned before, HW RX-hash already tell > us the L3 and L4 protocols and in-most-cases header-len. Even without > HW-hints, the XDP-prog likely have parsed the packet once. This parse > information is lost today, and redone by netstack. What about storing > this header parse info in meta data and re-use in this new XDP2SKB hook? > > The reason for suggesting this BPF-prog to be a callback, associated > with the net_device, were that HW is going to differ on what HW hints > that HW support. Thus, we can avoid a generic C-function that need to > check for all the possible hints, and instead have a BPF-prog that only > contains the code that is relevant for this net_device. But that's exactly what the xdp_copy_metadata_for_skb() is! It's a dynamic "BPF program" (generated as unrolled kfunc calls) just running in the helper and stashing the results in an intermediate struct in the metadata area. And once it's done that, we don't need *another* dynamic BPF program to read it back out and populate the SKB, because the intermediate format it's been stashed into is under the control of the kernel (we just need a flag to indicate that it's there). >>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes >>>> each of the kfuncs needed for the metadata used by SKBs, all of which >>>> get unrolled. The helper takes the output of these metadata-extracting >>>> kfuncs and stores it "somewhere". This "somewhere" could well be the >>>> metadata area; but in any case, since it's hidden away inside a helper >>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash >>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can >>>> then just read statically. We could even make this format match the >>>> field layout of struct sk_buff, so all we have to do is memcpy a >>>> contiguous chunk of memory when building the SKB. >>> >>> +1 > > Sorry, I think this "hiding" layout trick is going in a wrong direction. > > Imagine the use-case of cpumap redirect. The physical device XDP-prog > calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it > calls redirect into cpumap. On remote CPU, the xdp_frame is picked up, > and then I want to run another XDP-prog that want to look at these > HW-hints, and then likely call XDP_PASS which creates the SKB, also > using these HW-hints. I take it, it would not be possible when using > the xdp_copy_metadata_for_skb() helper? You're right that it should be possible to read the values back out again later. That is totally possible with this scheme, though; the 'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP program can just read that. We can explicitly support it by using the BTF ID as the "magic value" as you suggest, which would be fine by me. I still think we should be using the __randomize_layout trick, though, precisely so that BPF consumers are forced to use BTF relocations to read it; otherwise we risk the struct layout ossifying into UAPI because people are just going to assume it's static... >>> I'm currently doing exactly what you're suggesting (minus matching skb layout): >>> >>> struct xdp_to_skb_metadata { >>> u32 magic; // randomized at boot >>> ... skb-consumable-metadata in fixed format >>> } __randomize_layout; >>> >>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx, >>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs >>> to fill in the actual data. >>> >>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel >>> C code that parses that 'struct xdp_to_skb_metadata'. >>> (To be precise, I'm trying to parse the metadata from >>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not >>> 100% sure that's the right place). >>> (I also randomize the layout and magic to make sure userspace doesn't >>> depend on it because nothing stops this packet to be routed into xsk >>> socket..) >> >> Ah, nice trick with __randomize_layout - I agree we need to do something >> to prevent userspace from inadvertently starting to rely on this, and >> this seems like a great solution! > > Sorry, I disagree where this is going. Why do all of a sudden want to > prevent userspace (e.g. AF_XDP) from using this data?!? See above: I don't think we should prevent userspace from using it (and we're not), but we should prevent the struct layout from ossifying. > The hole exercise started with wanting to provide AF_XDP with these > HW-hints. The hints a standard AF_XDP user wants is likely very > similar to what the SKB user wants. Why do the AF_XDP user need to > open code this? > > The BTF-ID scheme precisely allows us to expose this layout to > userspace, and at the same time have freedom to change this in kernel > space, as userspace must decode the BTF-layout before reading this. > I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID > scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to > ask for? You can just consider the BTF-ID as the magic number, as it > will be more-or-less random per kernel build (and module load order). As mentioned above, I would be totally fine with just having the xdp_to_skb_metadata be part of BTF, enabling both XDP programs and AF_XDP consumers to re-use it. >> Look forward to seeing what the whole thing looks like in a more >> complete form :) > > I'm sort of on-board with the kfuncs and unroll-tricks, if I can see > some driver code that handles the issues of getting HW setup state > exposed needed to decode the RX-desc format. > > I sense that I myself, haven't been good enough to explain/convey the > BTF-ID scheme. Next week, I will code some examples that demo how > BTF-IDs can be used from BPF-progs, even as a communication channel > between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog -> > TC-BPF). For my part at least, it's not a lack of understanding that makes me prefer the kfunc approach. Rather, it's the complexity of having to resolve the multiple BTF IDs, and the risk of ossifying the struct layouts because people are going to do that wrong. Using kfuncs gives us much more control of the API, especially if we combine it with struct randomisation for the bits we do expose. Translating what we've discussed above into the terms used in your patch series, this would correspond to *only* having the xdp_metadata_common struct exposed via BTF, and not bothering with all the other driver-specific layouts. So an XDP/AF_XDP user that only wants to use the metadata that's also used by the stack can just call xdp_copy_metadata_for_skb(), and then read the resulting metadata area (using BTF). And if someone wants to access metadata that's *not* used by the stack, they'd have to call additional kfuncs to extract that. And similarly, if someone wants only a subset of the metadata used by an SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead just call the individual kfuncs to extract just the fields they want. I think this strikes a nice balance between the flexibility by the kernel to change things, the flexibility of XDP consumers to request only the data they want, and the ability for the same metadata to be consumed at different points. WDYT? -Toke
On 03/11/2022 13.48, Toke Høiland-Jørgensen wrote: > Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > >> On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote: >>> Stanislav Fomichev <sdf@google.com> writes: >>> >>>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >>>>> >>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes: >>>>> >>>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote: >>>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote: >>>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau >>>>>>>> <martin.lau@linux.dev> wrote: >>>>>>>>> >>>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote: >>>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without >>>>>>>>>>> using a >>>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the >>>>>>>>>>> metadata area. We could solve this with some code in libxdp, >>>>>>>>>>> though; if >>>>>>>>>>> this code can be made generic enough (so it just dumps the available >>>>>>>>>>> metadata functions from the running kernel at load time), it may be >>>>>>>>>>> possible to make it generic enough that it will be forward-compatible >>>>>>>>>>> with new versions of the kernel that add new fields, which should >>>>>>>>>>> alleviate Florian's concern about keeping things in sync. >>>>>>>>>> >>>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-( >>>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra >>>>>>>>>> info about at which offset the user would like to place the metadata >>>>>>>>>> and the library can generate the required bytecode? >>>>>>>>>> >>>>>>>>>>> 3. It will make it harder to consume the metadata when building >>>>>>>>>>> SKBs. I >>>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that >>>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe >>>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be >>>>>>>>>>> used for programs doing such redirects. E.g., you could just call >>>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that >>>>>>>>>>> would recursively expand into all the kfunc calls needed to extract >>>>>>>>>>> the >>>>>>>>>>> metadata supported by the SKB path? >>>>>>>>>> >>>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that >>>>>>>>> >>>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself? >>>>>>>>> Not sure where is the best point to specify this prog though. >>>>>>>>> Somehow during >>>>>>>>> bpf_xdp_redirect_map? >>>>>>>>> or this prog belongs to the target cpumap and the xdp prog >>>>>>>>> redirecting to this >>>>>>>>> cpumap has to write the meta layout in a way that the cpumap is >>>>>>>>> expecting? >>>>>>>> >>>>>>>> We're probably interested in triggering it from the places where xdp >>>>>>>> frames can eventually be converted into skbs? >>>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW, >>>>>>>> anything that's not XDP_DROP / AF_XDP redirect). >>>>>>>> We can probably make it magically work, and can generate >>>>>>>> kernel-digestible metadata whenever data == data_meta, but the >>>>>>>> question - should we? >>>>>>>> (need to make sure we won't regress any existing cases that are not >>>>>>>> relying on the metadata) >>>>>>> >>>>>>> Instead of having some kernel-digestible meta data, how about calling >>>>>>> another bpf prog to initialize the skb fields from the meta area after >>>>>>> __xdp_build_skb_from_frame() in the cpumap, so >>>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name. >>>>>>> >>>>>> >>>>>> I very much like this idea of calling another bpf prog to initialize the >>>>>> SKB fields from the meta area. (As a reminder, data need to come from >>>>>> meta area, because at this point the hardware RX-desc is out-of-scope). >>>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area. >>>>>> >>>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame(). >>>>>> >>>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog >>>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as >>>>>> context inputs. Right? (Not sure, if this is acceptable with the BPF >>>>>> maintainers new rules) >>>>>> >>>>>>> The xdp_prog@rx sets the meta data and then redirect. If the >>>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields >>>>>>> from the meta area, then there is no need to have a kfunc to enforce a >>>>>>> kernel-digestible layout. Not sure what is a good way to specify this >>>>>>> xdp_prog though... >>>>>> >>>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside >>>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the >>>>>> meta area for every device driver or XDP-prog populating this (as veth >>>>>> and cpumap can get redirected packets from multiple device drivers). >>>>> >>>>> If we have the helper to copy the data "out of" the drivers, why do we >>>>> need a second BPF program to copy data to the SKB? >>>>> >> >> IMHO the second BPF program to populate the SKB is needed to add >> flexibility and extensibility. >> >> My end-goal here is to speedup packet parsing. >> This BPF-prog should (in time) be able to update skb->transport_header >> and skb->network_header. As I mentioned before, HW RX-hash already tell >> us the L3 and L4 protocols and in-most-cases header-len. Even without >> HW-hints, the XDP-prog likely have parsed the packet once. This parse >> information is lost today, and redone by netstack. What about storing >> this header parse info in meta data and re-use in this new XDP2SKB hook? >> >> The reason for suggesting this BPF-prog to be a callback, associated >> with the net_device, were that HW is going to differ on what HW hints >> that HW support. Thus, we can avoid a generic C-function that need to >> check for all the possible hints, and instead have a BPF-prog that only >> contains the code that is relevant for this net_device. > > But that's exactly what the xdp_copy_metadata_for_skb() is! It's a > dynamic "BPF program" (generated as unrolled kfunc calls) just running > in the helper and stashing the results in an intermediate struct in the > metadata area. And once it's done that, we don't need *another* dynamic > BPF program to read it back out and populate the SKB, because the > intermediate format it's been stashed into is under the control of the > kernel (we just need a flag to indicate that it's there). > >>>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes >>>>> each of the kfuncs needed for the metadata used by SKBs, all of which >>>>> get unrolled. The helper takes the output of these metadata-extracting >>>>> kfuncs and stores it "somewhere". This "somewhere" could well be the >>>>> metadata area; but in any case, since it's hidden away inside a helper >>>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash >>>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can >>>>> then just read statically. We could even make this format match the >>>>> field layout of struct sk_buff, so all we have to do is memcpy a >>>>> contiguous chunk of memory when building the SKB. >>>> >>>> +1 >> >> Sorry, I think this "hiding" layout trick is going in a wrong direction. >> >> Imagine the use-case of cpumap redirect. The physical device XDP-prog >> calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it >> calls redirect into cpumap. On remote CPU, the xdp_frame is picked up, >> and then I want to run another XDP-prog that want to look at these >> HW-hints, and then likely call XDP_PASS which creates the SKB, also >> using these HW-hints. I take it, it would not be possible when using >> the xdp_copy_metadata_for_skb() helper? > > You're right that it should be possible to read the values back out > again later. That is totally possible with this scheme, though; the > 'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP > program can just read that. We can explicitly support it by using the > BTF ID as the "magic value" as you suggest, which would be fine by me. > I'm on-board if we as you suggest, add the BTF_ID as the "magic value" (as last member due to AF_XDP processing). When xdp_copy_metadata_for_skb() writes 'xdp_to_skb_metadata' in metadata area. We should simply see this BTF-ID as a 'cookie' or magic number. > I still think we should be using the __randomize_layout trick, though, > precisely so that BPF consumers are forced to use BTF relocations to > read it; otherwise we risk the struct layout ossifying into UAPI because > people are just going to assume it's static... > I'm also on-board with some level of randomization to the struct to force consumers to use BTF for relocations. e.g BTF-ID cookie/magic should be at a fixed location. >>>> I'm currently doing exactly what you're suggesting (minus matching skb layout): >>>> >>>> struct xdp_to_skb_metadata { >>>> u32 magic; // randomized at boot >>>> ... skb-consumable-metadata in fixed format >>>> } __randomize_layout; >>>> >>>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx, >>>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs >>>> to fill in the actual data. >>>> >>>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel >>>> C code that parses that 'struct xdp_to_skb_metadata'. >>>> (To be precise, I'm trying to parse the metadata from >>>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not >>>> 100% sure that's the right place). >>>> (I also randomize the layout and magic to make sure userspace doesn't >>>> depend on it because nothing stops this packet to be routed into xsk >>>> socket..) >>> >>> Ah, nice trick with __randomize_layout - I agree we need to do something >>> to prevent userspace from inadvertently starting to rely on this, and >>> this seems like a great solution! >> >> Sorry, I disagree where this is going. Why do all of a sudden want to >> prevent userspace (e.g. AF_XDP) from using this data?!? > > See above: I don't think we should prevent userspace from using it (and > we're not), but we should prevent the struct layout from ossifying. > Okay, then we are in agreement, avoid ossifying struct layout. >> The hole exercise started with wanting to provide AF_XDP with these >> HW-hints. The hints a standard AF_XDP user wants is likely very >> similar to what the SKB user wants. Why do the AF_XDP user need to >> open code this? >> >> The BTF-ID scheme precisely allows us to expose this layout to >> userspace, and at the same time have freedom to change this in kernel >> space, as userspace must decode the BTF-layout before reading this. >> I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID >> scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to >> ask for? You can just consider the BTF-ID as the magic number, as it >> will be more-or-less random per kernel build (and module load order). > > As mentioned above, I would be totally fine with just having the > xdp_to_skb_metadata be part of BTF, enabling both XDP programs and > AF_XDP consumers to re-use it. > My use-case is that AF_XDP will need to key on this runtime BTF-ID magic value anyway to read out 'xdp_to_skb_metadata' values. I have an XDP-prog running, that will RX-timestamp only the time-sync protocol packets. Code wise, I will simply add my RX-timestamp on top of struct 'xdp_to_skb_metadata' and then update the BTF-ID magic value. I don't need to add a real BTF-ID but just some magic value that my AF_XDP userspace prog knows about. In my current code[1], I'm playing nice and adds the BPF-prog's own local BTF-ID via bpf_core_type_id_local(). [1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80 >>> Look forward to seeing what the whole thing looks like in a more >>> complete form :) >> >> I'm sort of on-board with the kfuncs and unroll-tricks, if I can see >> some driver code that handles the issues of getting HW setup state >> exposed needed to decode the RX-desc format. >> >> I sense that I myself, haven't been good enough to explain/convey the >> BTF-ID scheme. Next week, I will code some examples that demo how >> BTF-IDs can be used from BPF-progs, even as a communication channel >> between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog -> >> TC-BPF). > > For my part at least, it's not a lack of understanding that makes me > prefer the kfunc approach. Rather, it's the complexity of having to > resolve the multiple BTF IDs, and the risk of ossifying the struct > layouts because people are going to do that wrong. Using kfuncs gives us > much more control of the API, especially if we combine it with struct > randomisation for the bits we do expose. > > Translating what we've discussed above into the terms used in your patch > series, this would correspond to *only* having the xdp_metadata_common > struct exposed via BTF, and not bothering with all the other > driver-specific layouts. So an XDP/AF_XDP user that only wants to use > the metadata that's also used by the stack can just call > xdp_copy_metadata_for_skb(), and then read the resulting metadata area > (using BTF). And if someone wants to access metadata that's *not* used > by the stack, they'd have to call additional kfuncs to extract that. > I agree, that this patchset does/will simplify my patchset. My driver specific structs for BTF-IDs will no-longer be needed. As it is now up-to XDP-prog to explicitly extend with fields. This should reduce your concern with resolving multiple BTF IDs. Maybe after this patchset, I would suggest that we could create some "kernel-central" struct's that have e.g. RX-timestamp and mark (mlx5 have HW support for mark) and protocol types (via RX-hash). As these could be used by XDP-prog's that explicitly extract these, and communicate the layout via setting the BTF-ID (via calling bpf_core_type_id_kernel()). Making it easier to consume from chained BPF-progs, AF_XDP and even via kernel C-code that updates SKB fields as the number of these magic BTF-ID's will be small enough. > And similarly, if someone wants only a subset of the metadata used by an > SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead > just call the individual kfuncs to extract just the fields they want. > > I think this strikes a nice balance between the flexibility by the > kernel to change things, the flexibility of XDP consumers to request > only the data they want, and the ability for the same metadata to be > consumed at different points. WDYT? With above comments, I think we are closer to an agreement again. --Jesper