Message ID | 20250307092356.638242-1-linyunsheng@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | fix the DMA API misuse problem for page_pool | expand |
Yunsheng Lin <linyunsheng@huawei.com> writes: > This patchset fix the dma API misuse problem as below: > Networking driver with page_pool support may hand over page > still with dma mapping to network stack and try to reuse that > page after network stack is done with it and passes it back > to page_pool to avoid the penalty of dma mapping/unmapping. > With all the caching in the network stack, some pages may be > held in the network stack without returning to the page_pool > soon enough, and with VF disable causing the driver unbound, > the page_pool does not stop the driver from doing it's > unbounding work, instead page_pool uses workqueue to check > if there is some pages coming back from the network stack > periodically, if there is any, it will do the dma unmmapping > related cleanup work. > > As mentioned in [1], attempting DMA unmaps after the driver > has already unbound may leak resources or at worst corrupt > memory. Fundamentally, the page pool code cannot allow DMA > mappings to outlive the driver they belong to. > > By using the 'struct page_pool_item' referenced by page->pp_item, > page_pool is not only able to keep track of the inflight page to > do dma unmmaping if some pages are still handled in networking > stack when page_pool_destroy() is called, and networking stack is > also able to find the page_pool owning the page when returning > pages back into page_pool: > 1. When a page is added to the page_pool, an item is deleted from > pool->hold_items and set the 'pp_netmem' pointing to that page > and set item->state and item->pp_netmem accordingly in order to > keep track of that page, refill from pool->release_items when > pool->hold_items is empty or use the item from pool->slow_items > when fast items run out. > 2. When a page is released from the page_pool, it is able to tell > which page_pool this page belongs to by masking off the lower > bits of the pointer to page_pool_item *item, as the 'struct > page_pool_item_block' is stored in the top of a struct page. > And after clearing the pp_item->state', the item for the > released page is added back to pool->release_items so that it > can be reused for new pages or just free it when it is from the > pool->slow_items. > 3. When page_pool_destroy() is called, item->state is used to tell > if a specific item is being used/dma mapped or not by scanning > all the item blocks in pool->item_blocks, then item->netmem can > be used to do the dma unmmaping if the corresponding inflight > page is dma mapped. You are making this incredibly complicated. You've basically implemented a whole new slab allocator for those page_pool_item objects, and you're tracking every page handed out by the page pool instead of just the ones that are DMA-mapped. None of this is needed. I took a stab at implementing the xarray-based tracking first suggested by Mina[0]: https://git.kernel.org/toke/c/e87e0edf9520 And, well, it's 50 lines of extra code, none of which are in the fast path. Jesper has kindly helped with testing that it works for normal packet processing, but I haven't yet verified that it resolves the original crash. Will post the patch to the list once I have verified this (help welcome!). -Toke [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
On 3/7/2025 10:15 PM, Toke Høiland-Jørgensen wrote: ... > > You are making this incredibly complicated. You've basically implemented > a whole new slab allocator for those page_pool_item objects, and you're > tracking every page handed out by the page pool instead of just the ones > that are DMA-mapped. None of this is needed. > > I took a stab at implementing the xarray-based tracking first suggested > by Mina[0]: I did discuss Mina' suggestion with Ilias below in case you didn't notice: https://lore.kernel.org/all/0ef315df-e8e9-41e8-9ba8-dcb69492c616@huawei.com/ Anyway, It is great that you take the effort to actually implement the idea to have some more concrete comparison here. > > https://git.kernel.org/toke/c/e87e0edf9520 > > And, well, it's 50 lines of extra code, none of which are in the fast > path. I wonder what is the overhead for the xarray idea regarding the time_bench_page_pool03_slow() testcase before we begin to discuss if xarray idea is indeed possible. > > Jesper has kindly helped with testing that it works for normal packet > processing, but I haven't yet verified that it resolves the original > crash. Will post the patch to the list once I have verified this (help > welcome!). RFC seems like a good way to show and discuss the basic idea. I only took a glance at git code above, it seems reusing the _pp_mapping_pad for pp_dma_index seems like a wrong direction as mentioned in discussion with Ilias above as the field might be used when a page is mmap'ed to user space, and reusing that field in 'struct page' seems to disable the tcp_zerocopy feature, see the below commit from Eric: https://github.com/torvalds/linux/commit/577e4432f3ac810049cb7e6b71f4d96ec7c6e894 Also, I am not sure if a page_pool owned page can be spliced into the fs subsystem yet, but if it does, I am not sure how is reusing the page->mapping possible if that page is called in __filemap_add_folio()? https://elixir.bootlin.com/linux/v6.14-rc5/source/mm/filemap.c#L882 > > -Toke > > [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ > >
Yunsheng Lin <yunshenglin0825@gmail.com> writes: > On 3/7/2025 10:15 PM, Toke Høiland-Jørgensen wrote: > > ... > >> >> You are making this incredibly complicated. You've basically implemented >> a whole new slab allocator for those page_pool_item objects, and you're >> tracking every page handed out by the page pool instead of just the ones >> that are DMA-mapped. None of this is needed. > > > I took a stab at implementing the xarray-based tracking first suggested >> by Mina[0]: > > I did discuss Mina' suggestion with Ilias below in case you didn't > notice: > https://lore.kernel.org/all/0ef315df-e8e9-41e8-9ba8-dcb69492c616@huawei.com/ I didn't; thanks for the pointer. See below. > Anyway, It is great that you take the effort to actually implement > the idea to have some more concrete comparison here. :) >> >> https://git.kernel.org/toke/c/e87e0edf9520 >> >> And, well, it's 50 lines of extra code, none of which are in the fast >> path. > > I wonder what is the overhead for the xarray idea regarding the > time_bench_page_pool03_slow() testcase before we begin to discuss > if xarray idea is indeed possible. Well, just running that benchmark shows no impact: | | Baseline | xarray | | | Cycles | ns | Cycles | ns | |-------------------------------+----------+--------+--------+--------| | no-softirq-page_pool01 | 20 | 5.713 | 19 | 5.516 | | no-softirq-page_pool02 | 56 | 15.560 | 57 | 15.864 | | no-softirq-page_pool03 | 225 | 62.763 | 222 | 61.728 | | tasklet_page_pool01_fast_path | 19 | 5.399 | 19 | 5.505 | | tasklet_page_pool02_ptr_ring | 54 | 15.090 | 54 | 15.018 | | tasklet_page_pool03_slow | 238 | 66.134 | 239 | 66.498 | ...however, the benchmark doesn't actually do any DMA mapping, so it's not super surprising that it doesn't show any difference: it's not exercising any of the xarray code. Your series shows a difference on this benchmark only because it does the page_pool_item allocation regardless of whether DMA is used or not. I guess we should try to come up with a micro-benchmark that does exercise the DMA code. Or just hack up the xarray patch to do the tracking regardless, for benchmarking purposes. >> Jesper has kindly helped with testing that it works for normal packet >> processing, but I haven't yet verified that it resolves the original >> crash. Will post the patch to the list once I have verified this (help >> welcome!). > > RFC seems like a good way to show and discuss the basic idea. Sure, I can send it as an RFC straight away if you prefer. Note that I'm on my way to netdevconf, though, so will probably have limited time to pay attention to this for the next week or so. > I only took a glance at git code above, it seems reusing the > _pp_mapping_pad for pp_dma_index seems like a wrong direction > as mentioned in discussion with Ilias above as the field might > be used when a page is mmap'ed to user space, and reusing that > field in 'struct page' seems to disable the tcp_zerocopy feature, > see the below commit from Eric: > https://github.com/torvalds/linux/commit/577e4432f3ac810049cb7e6b71f4d96ec7c6e894 > > Also, I am not sure if a page_pool owned page can be spliced into the fs > subsystem yet, but if it does, I am not sure how is reusing the > page->mapping possible if that page is called in __filemap_add_folio()? > > https://elixir.bootlin.com/linux/v6.14-rc5/source/mm/filemap.c#L882 Hmm, so I did look at the mapping field, but concluded using it wouldn't interfere with anything relevant as long as it's reset back to zero before the page is returned to the page allocator. However, I definitely missed the TCP zero-copy thing, and other things as well, it would seem (cf the discussion you referred to above). However, I did consider alternatives: AFAICT there should be space in the pp_magic field (used for the PP_SIGNATURE), so that with a bit of care we can stick an ID into the upper bits and still avoid ending up with a value that could look like a valid pointer. I didn't implement that initially because I wasn't sure it was necessary, but seeing as it is, I will take another look at it. I have one or two other ideas if this turns out not to pan out. -Toke