Message ID | 77a5d8c-406b-7068-4f17-23b7ac53bc83@google.com (mailing list archive) |
---|---|
Headers | show |
Series | arch: allow pte_offset_map[_lock]() to fail | expand |
On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote: > Two: pte_offset_map() will need to do an rcu_read_lock(), with the > corresponding rcu_read_unlock() in pte_unmap(). But most architectures > never supported CONFIG_HIGHPTE, so some don't always call pte_unmap() > after pte_offset_map(), or have used userspace pte_offset_map() where > pte_offset_kernel() is more correct. No problem in the current tree, > but a problem once an rcu_read_unlock() will be needed to keep balance. Hi Hugh, I shall have to spend some time looking at these patches, but at LSFMM just a few hours ago, I proposed and nobody objected to removing CONFIG_HIGHPTE. I don't intend to take action on that consensus immediately, so I can certainly wait until your patches are applied, but if this information simplifies what you're doing, feel free to act on it.
On Wed, 10 May 2023, Matthew Wilcox wrote: > On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote: > > Two: pte_offset_map() will need to do an rcu_read_lock(), with the > > corresponding rcu_read_unlock() in pte_unmap(). But most architectures > > never supported CONFIG_HIGHPTE, so some don't always call pte_unmap() > > after pte_offset_map(), or have used userspace pte_offset_map() where > > pte_offset_kernel() is more correct. No problem in the current tree, > > but a problem once an rcu_read_unlock() will be needed to keep balance. > > Hi Hugh, > > I shall have to spend some time looking at these patches, but at LSFMM > just a few hours ago, I proposed and nobody objected to removing > CONFIG_HIGHPTE. I don't intend to take action on that consensus > immediately, so I can certainly wait until your patches are applied, but > if this information simplifies what you're doing, feel free to act on it. Thanks a lot, Matthew: very considerate, as usual. Yes, I did see your "Whither Highmem?" (wither highmem!) proposal on the list, and it did make me think, better get these patches and preview out soon, before you get to vanish pte_unmap() altogether. HIGHMEM or not, HIGHPTE or not, I think pte_offset_map() and pte_unmap() still have an important role to play. I don't really understand why you're going down a remove-CONFIG_HIGHPTE route: I thought you were motivated by the awkardness of kmap on large folios; but I don't see how removing HIGHPTE helps with that at all (unless you have a "large page tables" effort in mind, but I doubt it). But I've no investment in CONFIG_HIGHPTE if people think now is the time to remove it: I disagree, but wouldn't miss it myself - so long as you leave pte_offset_map() and pte_unmap() (under whatever names). I don't think removing CONFIG_HIGHPTE will simplify what I'm doing. For a moment it looked like it would: the PAE case is nasty (and our data centres have not been on PAE for a long time, so it wasn't a problem I had to face before); and knowing pmd_high must be 0 for a page table looked like it would help, but now I'm not so sure of that (hmm, I'm changing my mind again as I write). Peter's pmdp_get_lockless() does rely for complete correctness on interrupts being disabled, and I suspect that I may be forced in the PAE case to do so briefly; but detest that notion. For now I'm just deferring it, hoping for a better idea before third series finalized. I mention this (and Cc Peter) in passing: don't want this arch thread to go down into that rabbit hole: we can start a fresh thread on it if you wish, but right now my priority is commit messages for the second series, rather than solving (or even detailing) the PAE problem. Hugh
On Wed, May 10, 2023 at 09:35:44PM -0700, Hugh Dickins wrote: > On Wed, 10 May 2023, Matthew Wilcox wrote: > > On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote: > > > Two: pte_offset_map() will need to do an rcu_read_lock(), with the > > > corresponding rcu_read_unlock() in pte_unmap(). But most architectures > > > never supported CONFIG_HIGHPTE, so some don't always call pte_unmap() > > > after pte_offset_map(), or have used userspace pte_offset_map() where > > > pte_offset_kernel() is more correct. No problem in the current tree, > > > but a problem once an rcu_read_unlock() will be needed to keep balance. > > > > Hi Hugh, > > > > I shall have to spend some time looking at these patches, but at LSFMM > > just a few hours ago, I proposed and nobody objected to removing > > CONFIG_HIGHPTE. I don't intend to take action on that consensus > > immediately, so I can certainly wait until your patches are applied, but > > if this information simplifies what you're doing, feel free to act on it. > > Thanks a lot, Matthew: very considerate, as usual. > > Yes, I did see your "Whither Highmem?" (wither highmem!) proposal on the I'm glad somebody noticed the pun ;-) > list, and it did make me think, better get these patches and preview out > soon, before you get to vanish pte_unmap() altogether. HIGHMEM or not, > HIGHPTE or not, I think pte_offset_map() and pte_unmap() still have an > important role to play. > > I don't really understand why you're going down a remove-CONFIG_HIGHPTE > route: I thought you were motivated by the awkardness of kmap on large > folios; but I don't see how removing HIGHPTE helps with that at all > (unless you have a "large page tables" effort in mind, but I doubt it). Quite right, my primary concern is filesystem metadata; primarily directories as I don't think anybody has ever supported symlinks or superblocks larger than 4kB. I was thinking that removing CONFIG_HIGHPTE might simplify the page fault handling path a little, but now I've looked at it some more, and I'm not sure there's any simplification to be had. It should probably use kmap_local instead of kmap_atomic(), though. > But I've no investment in CONFIG_HIGHPTE if people think now is the > time to remove it: I disagree, but wouldn't miss it myself - so long > as you leave pte_offset_map() and pte_unmap() (under whatever names). > > I don't think removing CONFIG_HIGHPTE will simplify what I'm doing. > For a moment it looked like it would: the PAE case is nasty (and our > data centres have not been on PAE for a long time, so it wasn't a > problem I had to face before); and knowing pmd_high must be 0 for a > page table looked like it would help, but now I'm not so sure of that > (hmm, I'm changing my mind again as I write). > > Peter's pmdp_get_lockless() does rely for complete correctness on > interrupts being disabled, and I suspect that I may be forced in the > PAE case to do so briefly; but detest that notion. For now I'm just > deferring it, hoping for a better idea before third series finalized. > > I mention this (and Cc Peter) in passing: don't want this arch thread > to go down into that rabbit hole: we can start a fresh thread on it if > you wish, but right now my priority is commit messages for the second > series, rather than solving (or even detailing) the PAE problem. I infer that what you need is a pte_access_start() and a pte_access_end() which look like they can be plausibly rcu_read_lock() and rcu_read_unlock(), but might need to be local_irq_save() and local_irq_restore() in some configurations? We also talked about moving x86 to always RCU-free page tables in order to make accessing /proc/$pid/smaps lockless. I believe Michel is going to take a swing at this project.
On Thu, 11 May 2023, Matthew Wilcox wrote: > > I was thinking that removing CONFIG_HIGHPTE might simplify the page > fault handling path a little, but now I've looked at it some more, and > I'm not sure there's any simplification to be had. It should probably > use kmap_local instead of kmap_atomic(), though. Re kmap_local, yes, one of the patches in the next series does make that change. > > I infer that what you need is a pte_access_start() and a > pte_access_end() which look like they can be plausibly rcu_read_lock() > and rcu_read_unlock(), but might need to be local_irq_save() and > local_irq_restore() in some configurations? Yes, except that the local_irq_restore() in PAE-like configurations (if we need it at all) is not delayed until the pte_access_end() or pte_unmap() - it's internal to the pte_access_start() or pte_offset_map(): interrupts only disabled across the getting of a consistent pmd entry. Over-generalizing a little, any user of pte_offset_map() (as opposed to pte_offset_map_lock()) has to be prepared for the ptes to change under them: but we do need to give them something that is or was recently the relevant page table, rather than a random page mishmashed from mismatched pmd_low and pmd_high. > > We also talked about moving x86 to always RCU-free page tables in > order to make accessing /proc/$pid/smaps lockless. I believe Michel > is going to take a swing at this project. (And /proc/$pid/numa_maps, I hope: that's even worse in some way, IIRC.) That might be orthogonal to what I'm doing: many non-x86 architectures already do RCU-freeing of page tables via the TLB route, but that doesn't cover a pte_free() from retract_page_tables() or collapse_and_free_pmd(). Hugh
Hi, On Thu, May 11, 2023 at 03:02:55PM +0100, Matthew Wilcox wrote: > On Wed, May 10, 2023 at 09:35:44PM -0700, Hugh Dickins wrote: > > On Wed, 10 May 2023, Matthew Wilcox wrote: > > > > I don't really understand why you're going down a remove-CONFIG_HIGHPTE > > route: I thought you were motivated by the awkardness of kmap on large > > folios; but I don't see how removing HIGHPTE helps with that at all > > (unless you have a "large page tables" effort in mind, but I doubt it). > > Quite right, my primary concern is filesystem metadata; primarily > directories as I don't think anybody has ever supported symlinks or > superblocks larger than 4kB. > > I was thinking that removing CONFIG_HIGHPTE might simplify the page > fault handling path a little, but now I've looked at it some more, and > I'm not sure there's any simplification to be had. It should probably > use kmap_local instead of kmap_atomic(), though. Removing CONFIG_HIGHPTE will drop several lines and will allow to get rid of custom __pte_alloc_one on x86. -- Sincerely yours, Mike.
On Thu, May 11, 2023 at 03:02:55PM +0100, Matthew Wilcox wrote: > We also talked about moving x86 to always RCU-free page tables in > order to make accessing /proc/$pid/smaps lockless. I believe Michel > is going to take a swing at this project. Shouldn't be too controversial I think -- effectively everybody already has it enabled because everybody builds with KVM enabled.