Message ID | 20240215121756.2734131-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | Reduce cost of ptep_get_lockless on arm64 | expand |
On 15.02.24 13:17, Ryan Roberts wrote: > This is an RFC for a series that aims to reduce the cost and complexity of > ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. > The approach came from discussion with Mark and David [2]. > > It introduces a new helper, ptep_get_lockless_norecency(), which allows the > access and dirty bits in the returned pte to be incorrect. This relaxation > permits arm64's implementation to just read the single target pte, and avoids > having to iterate over the full contpte block to gather the access and dirty > bits, for the contpte case. > > It turns out that none of the call sites using ptep_get_lockless() require > accurate access and dirty bit information, so we can also convert those sites. > Although a couple of places need care (see patches 2 and 3). > > Arguably patch 3 is a bit fragile, given the wide accessibility of > vmf->orig_pte. So it might make sense to drop this patch and stick to using > ptep_get_lockless() in the page fault path. I'm keen to hear opinions. Yes. Especially as we have these pte_same() checks that might just fail now because of wrong accessed/dirty bits? Likely, we just want to read "the real deal" on both sides of the pte_same() handling. > > I've chosen the name "recency" because it's shortish and somewhat descriptive, > and is alredy used in a couple of places to mean similar things (see mglru and > damon). I'm open to other names if anyone has better ideas. Not a native speaker; works for me. > > If concensus is that this approach is generally acceptable, I intend to create a > series in future to do a similar thing with ptep_get() -> ptep_get_norecency(). Yes, sounds good to me.
On 26/03/2024 16:17, David Hildenbrand wrote: > On 15.02.24 13:17, Ryan Roberts wrote: >> This is an RFC for a series that aims to reduce the cost and complexity of >> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. >> The approach came from discussion with Mark and David [2]. >> >> It introduces a new helper, ptep_get_lockless_norecency(), which allows the >> access and dirty bits in the returned pte to be incorrect. This relaxation >> permits arm64's implementation to just read the single target pte, and avoids >> having to iterate over the full contpte block to gather the access and dirty >> bits, for the contpte case. >> >> It turns out that none of the call sites using ptep_get_lockless() require >> accurate access and dirty bit information, so we can also convert those sites. >> Although a couple of places need care (see patches 2 and 3). >> >> Arguably patch 3 is a bit fragile, given the wide accessibility of >> vmf->orig_pte. So it might make sense to drop this patch and stick to using >> ptep_get_lockless() in the page fault path. I'm keen to hear opinions. > > Yes. Especially as we have these pte_same() checks that might just fail now > because of wrong accessed/dirty bits? Which pte_same() checks are you referring to? I've changed them all to pte_same_norecency() which ignores the access/dirty bits when doing the comparison. > > Likely, we just want to read "the real deal" on both sides of the pte_same() > handling. Sorry I'm not sure I understand? You mean read the full pte including access/dirty? That's the same as dropping the patch, right? Of course if we do that, we still have to keep pte_get_lockless() around for this case. In an ideal world we would convert everything over to ptep_get_lockless_norecency() and delete ptep_get_lockless() to remove the ugliness from arm64. > >> >> I've chosen the name "recency" because it's shortish and somewhat descriptive, >> and is alredy used in a couple of places to mean similar things (see mglru and >> damon). I'm open to other names if anyone has better ideas. > > Not a native speaker; works for me. > >> >> If concensus is that this approach is generally acceptable, I intend to create a >> series in future to do a similar thing with ptep_get() -> ptep_get_norecency(). > > Yes, sounds good to me. >
On 26/03/2024 16:34, David Hildenbrand wrote: > On 26.03.24 17:31, Ryan Roberts wrote: >> On 26/03/2024 16:17, David Hildenbrand wrote: >>> On 15.02.24 13:17, Ryan Roberts wrote: >>>> This is an RFC for a series that aims to reduce the cost and complexity of >>>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1]. >>>> The approach came from discussion with Mark and David [2]. >>>> >>>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the >>>> access and dirty bits in the returned pte to be incorrect. This relaxation >>>> permits arm64's implementation to just read the single target pte, and avoids >>>> having to iterate over the full contpte block to gather the access and dirty >>>> bits, for the contpte case. >>>> >>>> It turns out that none of the call sites using ptep_get_lockless() require >>>> accurate access and dirty bit information, so we can also convert those sites. >>>> Although a couple of places need care (see patches 2 and 3). >>>> >>>> Arguably patch 3 is a bit fragile, given the wide accessibility of >>>> vmf->orig_pte. So it might make sense to drop this patch and stick to using >>>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions. >>> >>> Yes. Especially as we have these pte_same() checks that might just fail now >>> because of wrong accessed/dirty bits? >> >> Which pte_same() checks are you referring to? I've changed them all to >> pte_same_norecency() which ignores the access/dirty bits when doing the >> comparison. > > I'm reading the patches just now. So I stumbled over that just after I wrote > that, so I was missing that part from the description here. > >> >>> >>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>> handling. >> >> Sorry I'm not sure I understand? You mean read the full pte including >> access/dirty? That's the same as dropping the patch, right? Of course if we do >> that, we still have to keep pte_get_lockless() around for this case. In an ideal >> world we would convert everything over to ptep_get_lockless_norecency() and >> delete ptep_get_lockless() to remove the ugliness from arm64. > > Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any > architecture. > > I do wonder if pte_same_norecency() should be defined per architecture and the > default would be pte_same(). So we could avoid the mkold etc on all other > architectures. Wouldn't that break it's semantics? The "norecency" of ptep_get_lockless_norecency() means "recency information in the returned pte may be incorrect". But the "norecency" of pte_same_norecency() means "ignore the access and dirty bits when you do the comparison". I think you could only do the optimization you describe if you required that pte_same_norecency() would only be given values returned by ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not quite the same; if a page is accessed between gets one will return true and the other false.
>>>> >>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>> handling. >>> >>> Sorry I'm not sure I understand? You mean read the full pte including >>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>> that, we still have to keep pte_get_lockless() around for this case. In an ideal >>> world we would convert everything over to ptep_get_lockless_norecency() and >>> delete ptep_get_lockless() to remove the ugliness from arm64. >> >> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >> architecture. >> >> I do wonder if pte_same_norecency() should be defined per architecture and the >> default would be pte_same(). So we could avoid the mkold etc on all other >> architectures. > > Wouldn't that break it's semantics? The "norecency" of > ptep_get_lockless_norecency() means "recency information in the returned pte may > be incorrect". But the "norecency" of pte_same_norecency() means "ignore the > access and dirty bits when you do the comparison". My idea was that ptep_get_lockless_norecency() would return the actual result on these architectures. So e.g., on x86, there would be no actual change in generated code. But yes, the documentation of these functions would have to be improved. Now I wonder if ptep_get_lockless_norecency() should actively clear dirty/accessed bits to more easily find any actual issues where the bits still matter ... > > I think you could only do the optimization you describe if you required that > pte_same_norecency() would only be given values returned by > ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not > quite the same; if a page is accessed between gets one will return true and the > other false. Right.
On 26/03/2024 17:04, David Hildenbrand wrote: >>>>> >>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>> handling. >>>> >>>> Sorry I'm not sure I understand? You mean read the full pte including >>>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>> ideal >>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>> >>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >>> architecture. >>> >>> I do wonder if pte_same_norecency() should be defined per architecture and the >>> default would be pte_same(). So we could avoid the mkold etc on all other >>> architectures. >> >> Wouldn't that break it's semantics? The "norecency" of >> ptep_get_lockless_norecency() means "recency information in the returned pte may >> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >> access and dirty bits when you do the comparison". > > My idea was that ptep_get_lockless_norecency() would return the actual result on > these architectures. So e.g., on x86, there would be no actual change in > generated code. I think this is a bad plan... You'll end up with subtle differences between architectures. > > But yes, the documentation of these functions would have to be improved. > > Now I wonder if ptep_get_lockless_norecency() should actively clear > dirty/accessed bits to more easily find any actual issues where the bits still > matter ... I did a version that took that approach. Decided it was not as good as this way though. Now for the life of me, I can't remember my reasoning. > >> >> I think you could only do the optimization you describe if you required that >> pte_same_norecency() would only be given values returned by >> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not >> quite the same; if a page is accessed between gets one will return true and the >> other false. > > Right. >
On 26.03.24 18:32, Ryan Roberts wrote: > On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>> >>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>> handling. >>>>> >>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do >>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>> ideal >>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>> >>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any >>>> architecture. >>>> >>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>> architectures. >>> >>> Wouldn't that break it's semantics? The "norecency" of >>> ptep_get_lockless_norecency() means "recency information in the returned pte may >>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>> access and dirty bits when you do the comparison". >> >> My idea was that ptep_get_lockless_norecency() would return the actual result on >> these architectures. So e.g., on x86, there would be no actual change in >> generated code. > > I think this is a bad plan... You'll end up with subtle differences between > architectures. > >> >> But yes, the documentation of these functions would have to be improved. >> >> Now I wonder if ptep_get_lockless_norecency() should actively clear >> dirty/accessed bits to more easily find any actual issues where the bits still >> matter ... > > I did a version that took that approach. Decided it was not as good as this way > though. Now for the life of me, I can't remember my reasoning. Maybe because there are some code paths that check accessed/dirty without "correctness" implications? For example, if the PTE is already dirty, no need to set it dirty etc?
On 26/03/2024 17:39, David Hildenbrand wrote: > On 26.03.24 18:32, Ryan Roberts wrote: >> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>> >>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>>> handling. >>>>>> >>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>> we do >>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>> ideal >>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>> >>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>> any >>>>> architecture. >>>>> >>>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>> architectures. >>>> >>>> Wouldn't that break it's semantics? The "norecency" of >>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>> may >>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>> access and dirty bits when you do the comparison". >>> >>> My idea was that ptep_get_lockless_norecency() would return the actual result on >>> these architectures. So e.g., on x86, there would be no actual change in >>> generated code. >> >> I think this is a bad plan... You'll end up with subtle differences between >> architectures. >> >>> >>> But yes, the documentation of these functions would have to be improved. >>> >>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>> dirty/accessed bits to more easily find any actual issues where the bits still >>> matter ... >> >> I did a version that took that approach. Decided it was not as good as this way >> though. Now for the life of me, I can't remember my reasoning. > > Maybe because there are some code paths that check accessed/dirty without > "correctness" implications? For example, if the PTE is already dirty, no need to > set it dirty etc? I think I decided I was penalizing the architectures that don't care because all their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so that I could feed its result into pte_same().
On 26.03.24 18:51, Ryan Roberts wrote: > On 26/03/2024 17:39, David Hildenbrand wrote: >> On 26.03.24 18:32, Ryan Roberts wrote: >>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same() >>>>>>>> handling. >>>>>>> >>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>> we do >>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>> ideal >>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>> >>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>> any >>>>>> architecture. >>>>>> >>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the >>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>> architectures. >>>>> >>>>> Wouldn't that break it's semantics? The "norecency" of >>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>> may >>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>> access and dirty bits when you do the comparison". >>>> >>>> My idea was that ptep_get_lockless_norecency() would return the actual result on >>>> these architectures. So e.g., on x86, there would be no actual change in >>>> generated code. >>> >>> I think this is a bad plan... You'll end up with subtle differences between >>> architectures. >>> >>>> >>>> But yes, the documentation of these functions would have to be improved. >>>> >>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>> matter ... >>> >>> I did a version that took that approach. Decided it was not as good as this way >>> though. Now for the life of me, I can't remember my reasoning. >> >> Maybe because there are some code paths that check accessed/dirty without >> "correctness" implications? For example, if the PTE is already dirty, no need to >> set it dirty etc? > > I think I decided I was penalizing the architectures that don't care because all > their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly > clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so > that I could feed its result into pte_same(). True. With ptep_get_norecency() you're also penalizing other architectures. Therefore my original thought about making the behavior arch-specific, but the arch has to make sure to get the combination of ptep_get_lockless_norecency()+ptep_same_norecency() is right. So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must make sure to also ignore them in ptep_same_norecency(), and must be able to handle access/dirty bit changes differently. Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed accessed or dirty". Only the former would end up ignoring stuff here, the latter would not. But again, just some random thoughts how this affects other architectures and how we could avoid it. The issue I describe in patch #3 would be gone if ptep_same_norecency() would just do a ptep_same() check on other architectures -- and would make it easier to sell :)
On 27/03/2024 09:34, David Hildenbrand wrote: > On 26.03.24 18:51, Ryan Roberts wrote: >> On 26/03/2024 17:39, David Hildenbrand wrote: >>> On 26.03.24 18:32, Ryan Roberts wrote: >>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>> pte_same() >>>>>>>>> handling. >>>>>>>> >>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>> we do >>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>> ideal >>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>> >>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>> any >>>>>>> architecture. >>>>>>> >>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>> and the >>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>> architectures. >>>>>> >>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>> may >>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>> access and dirty bits when you do the comparison". >>>>> >>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>> result on >>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>> generated code. >>>> >>>> I think this is a bad plan... You'll end up with subtle differences between >>>> architectures. >>>> >>>>> >>>>> But yes, the documentation of these functions would have to be improved. >>>>> >>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>> matter ... >>>> >>>> I did a version that took that approach. Decided it was not as good as this way >>>> though. Now for the life of me, I can't remember my reasoning. >>> >>> Maybe because there are some code paths that check accessed/dirty without >>> "correctness" implications? For example, if the PTE is already dirty, no need to >>> set it dirty etc? >> >> I think I decided I was penalizing the architectures that don't care because all >> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >> that I could feed its result into pte_same(). > > True. With ptep_get_norecency() you're also penalizing other architectures. > Therefore my original thought about making the behavior arch-specific, but the > arch has to make sure to get the combination of > ptep_get_lockless_norecency()+ptep_same_norecency() is right. > > So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must > make sure to also ignore them in ptep_same_norecency(), and must be able to > handle access/dirty bit changes differently. > > Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed > accessed or dirty". Only the former would end up ignoring stuff here, the latter > would not. > > But again, just some random thoughts how this affects other architectures and > how we could avoid it. The issue I describe in patch #3 would be gone if > ptep_same_norecency() would just do a ptep_same() check on other architectures > -- and would make it easier to sell :) Perhaps - let me chew on that for a bit. It doesn't feel as easy as you suggest to me. But I can't put my finger on why...
On 27/03/2024 09:34, David Hildenbrand wrote: > On 26.03.24 18:51, Ryan Roberts wrote: >> On 26/03/2024 17:39, David Hildenbrand wrote: >>> On 26.03.24 18:32, Ryan Roberts wrote: >>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>> pte_same() >>>>>>>>> handling. >>>>>>>> >>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>> we do >>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>> ideal >>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>> >>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>> any >>>>>>> architecture. >>>>>>> >>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>> and the >>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>> architectures. >>>>>> >>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>> may >>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>> access and dirty bits when you do the comparison". >>>>> >>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>> result on >>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>> generated code. >>>> >>>> I think this is a bad plan... You'll end up with subtle differences between >>>> architectures. >>>> >>>>> >>>>> But yes, the documentation of these functions would have to be improved. >>>>> >>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>> matter ... >>>> >>>> I did a version that took that approach. Decided it was not as good as this way >>>> though. Now for the life of me, I can't remember my reasoning. >>> >>> Maybe because there are some code paths that check accessed/dirty without >>> "correctness" implications? For example, if the PTE is already dirty, no need to >>> set it dirty etc? >> >> I think I decided I was penalizing the architectures that don't care because all >> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >> that I could feed its result into pte_same(). > > True. With ptep_get_norecency() you're also penalizing other architectures. > Therefore my original thought about making the behavior arch-specific, but the > arch has to make sure to get the combination of > ptep_get_lockless_norecency()+ptep_same_norecency() is right. > > So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must > make sure to also ignore them in ptep_same_norecency(), and must be able to > handle access/dirty bit changes differently. > > Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed > accessed or dirty". Only the former would end up ignoring stuff here, the latter > would not. > > But again, just some random thoughts how this affects other architectures and > how we could avoid it. The issue I describe in patch #3 would be gone if > ptep_same_norecency() would just do a ptep_same() check on other architectures > -- and would make it easier to sell :) > I've been thinking some more about this. I think your proposal is the following: // ARM64 ptep_get_lockless_norecency() { - returned access/dirty may be incorrect - returned access/dirty may be differently incorrect between 2 calls } pte_same_norecency() { - ignore access/dirty when doing comparison } ptep_set_access_flags(ptep, pte) { - must not assume access/dirty in pte are "more permissive" than access/dirty in *ptep - must only set access/dirty in *ptep, never clear } // Other arches: no change to generated code ptep_get_lockless_norecency() { return ptep_get_lockless(); } pte_same_norecency() { return pte_same(); } ptep_set_access_flags(ptep, pte) { - may assume access/dirty in pte are "more permissive" than access/dirty in *ptep - if no HW access/dirty updates, "*ptep = pte" always results in "more permissive" change } An arch either specializes all 3 or none of them. This would allow us to get rid of ptep_get_lockless(). And it addresses the bug you found with ptep_set_access_flags(). BUT, I still have a nagging feeling that there are likely to be other similar problems caused by ignoring access/dirty during pte_same_norecency(). I can't convince myself that its definitely all safe and robust. So I'm leaning towards dropping patch 3 and therefore keeping ptep_get_lockless() around. Let me know if you have any insight that might help me change my mind :) Thanks, Ryan
On 03.04.24 14:59, Ryan Roberts wrote: > On 27/03/2024 09:34, David Hildenbrand wrote: >> On 26.03.24 18:51, Ryan Roberts wrote: >>> On 26/03/2024 17:39, David Hildenbrand wrote: >>>> On 26.03.24 18:32, Ryan Roberts wrote: >>>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>>> pte_same() >>>>>>>>>> handling. >>>>>>>>> >>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>>> we do >>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an >>>>>>>>> ideal >>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and >>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>>> >>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect >>>>>>>> any >>>>>>>> architecture. >>>>>>>> >>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>>> and the >>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>>> architectures. >>>>>>> >>>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte >>>>>>> may >>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the >>>>>>> access and dirty bits when you do the comparison". >>>>>> >>>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>>> result on >>>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>>> generated code. >>>>> >>>>> I think this is a bad plan... You'll end up with subtle differences between >>>>> architectures. >>>>> >>>>>> >>>>>> But yes, the documentation of these functions would have to be improved. >>>>>> >>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>>> dirty/accessed bits to more easily find any actual issues where the bits still >>>>>> matter ... >>>>> >>>>> I did a version that took that approach. Decided it was not as good as this way >>>>> though. Now for the life of me, I can't remember my reasoning. >>>> >>>> Maybe because there are some code paths that check accessed/dirty without >>>> "correctness" implications? For example, if the PTE is already dirty, no need to >>>> set it dirty etc? >>> >>> I think I decided I was penalizing the architectures that don't care because all >>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >>> that I could feed its result into pte_same(). >> >> True. With ptep_get_norecency() you're also penalizing other architectures. >> Therefore my original thought about making the behavior arch-specific, but the >> arch has to make sure to get the combination of >> ptep_get_lockless_norecency()+ptep_same_norecency() is right. >> >> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must >> make sure to also ignore them in ptep_same_norecency(), and must be able to >> handle access/dirty bit changes differently. >> >> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed >> accessed or dirty". Only the former would end up ignoring stuff here, the latter >> would not. >> >> But again, just some random thoughts how this affects other architectures and >> how we could avoid it. The issue I describe in patch #3 would be gone if >> ptep_same_norecency() would just do a ptep_same() check on other architectures >> -- and would make it easier to sell :) >> > > I've been thinking some more about this. I think your proposal is the following: > > > // ARM64 > ptep_get_lockless_norecency() > { > - returned access/dirty may be incorrect > - returned access/dirty may be differently incorrect between 2 calls > } > pte_same_norecency() > { > - ignore access/dirty when doing comparison > } > ptep_set_access_flags(ptep, pte) > { > - must not assume access/dirty in pte are "more permissive" than > access/dirty in *ptep > - must only set access/dirty in *ptep, never clear > } > > > // Other arches: no change to generated code > ptep_get_lockless_norecency() > { > return ptep_get_lockless(); > } > pte_same_norecency() > { > return pte_same(); > } > ptep_set_access_flags(ptep, pte) > { > - may assume access/dirty in pte are "more permissive" than access/dirty > in *ptep > - if no HW access/dirty updates, "*ptep = pte" always results in "more > permissive" change > } > > An arch either specializes all 3 or none of them. > > This would allow us to get rid of ptep_get_lockless(). > > And it addresses the bug you found with ptep_set_access_flags(). > > > BUT, I still have a nagging feeling that there are likely to be other similar > problems caused by ignoring access/dirty during pte_same_norecency(). I can't > convince myself that its definitely all safe and robust. Right, we'd have to identify the other possible cases and document what an arch + common code must stick to to make it work. Some rules would be: if an arch implements ptep_get_lockless_norecency(): (1) Passing the result from ptep_get_lockless_norecency() to pte_same() is wrong. (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after ptep_get_lockless_norecency() is very likely wrong. > > So I'm leaning towards dropping patch 3 and therefore keeping > ptep_get_lockless() around. > > Let me know if you have any insight that might help me change my mind :) I'm wondering if it would help if we could find a better name (or concept) for "norecency" here, that expresses that only on some archs we'd have that fuzzy handling. Keeping ptep_get_lockless() around for now sounds like the best alternative.
On 08/04/2024 09:36, David Hildenbrand wrote: > On 03.04.24 14:59, Ryan Roberts wrote: >> On 27/03/2024 09:34, David Hildenbrand wrote: >>> On 26.03.24 18:51, Ryan Roberts wrote: >>>> On 26/03/2024 17:39, David Hildenbrand wrote: >>>>> On 26.03.24 18:32, Ryan Roberts wrote: >>>>>> On 26/03/2024 17:04, David Hildenbrand wrote: >>>>>>>>>>> >>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the >>>>>>>>>>> pte_same() >>>>>>>>>>> handling. >>>>>>>>>> >>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including >>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if >>>>>>>>>> we do >>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. >>>>>>>>>> In an >>>>>>>>>> ideal >>>>>>>>>> world we would convert everything over to >>>>>>>>>> ptep_get_lockless_norecency() and >>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64. >>>>>>>>> >>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really >>>>>>>>> affect >>>>>>>>> any >>>>>>>>> architecture. >>>>>>>>> >>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture >>>>>>>>> and the >>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other >>>>>>>>> architectures. >>>>>>>> >>>>>>>> Wouldn't that break it's semantics? The "norecency" of >>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned >>>>>>>> pte >>>>>>>> may >>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore >>>>>>>> the >>>>>>>> access and dirty bits when you do the comparison". >>>>>>> >>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual >>>>>>> result on >>>>>>> these architectures. So e.g., on x86, there would be no actual change in >>>>>>> generated code. >>>>>> >>>>>> I think this is a bad plan... You'll end up with subtle differences between >>>>>> architectures. >>>>>> >>>>>>> >>>>>>> But yes, the documentation of these functions would have to be improved. >>>>>>> >>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear >>>>>>> dirty/accessed bits to more easily find any actual issues where the bits >>>>>>> still >>>>>>> matter ... >>>>>> >>>>>> I did a version that took that approach. Decided it was not as good as >>>>>> this way >>>>>> though. Now for the life of me, I can't remember my reasoning. >>>>> >>>>> Maybe because there are some code paths that check accessed/dirty without >>>>> "correctness" implications? For example, if the PTE is already dirty, no >>>>> need to >>>>> set it dirty etc? >>>> >>>> I think I decided I was penalizing the architectures that don't care because >>>> all >>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly >>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so >>>> that I could feed its result into pte_same(). >>> >>> True. With ptep_get_norecency() you're also penalizing other architectures. >>> Therefore my original thought about making the behavior arch-specific, but the >>> arch has to make sure to get the combination of >>> ptep_get_lockless_norecency()+ptep_same_norecency() is right. >>> >>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must >>> make sure to also ignore them in ptep_same_norecency(), and must be able to >>> handle access/dirty bit changes differently. >>> >>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed >>> accessed or dirty". Only the former would end up ignoring stuff here, the latter >>> would not. >>> >>> But again, just some random thoughts how this affects other architectures and >>> how we could avoid it. The issue I describe in patch #3 would be gone if >>> ptep_same_norecency() would just do a ptep_same() check on other architectures >>> -- and would make it easier to sell :) >>> >> >> I've been thinking some more about this. I think your proposal is the following: >> >> >> // ARM64 >> ptep_get_lockless_norecency() >> { >> - returned access/dirty may be incorrect >> - returned access/dirty may be differently incorrect between 2 calls >> } >> pte_same_norecency() >> { >> - ignore access/dirty when doing comparison >> } >> ptep_set_access_flags(ptep, pte) >> { >> - must not assume access/dirty in pte are "more permissive" than >> access/dirty in *ptep >> - must only set access/dirty in *ptep, never clear >> } >> >> >> // Other arches: no change to generated code >> ptep_get_lockless_norecency() >> { >> return ptep_get_lockless(); >> } >> pte_same_norecency() >> { >> return pte_same(); >> } >> ptep_set_access_flags(ptep, pte) >> { >> - may assume access/dirty in pte are "more permissive" than access/dirty >> in *ptep >> - if no HW access/dirty updates, "*ptep = pte" always results in "more >> permissive" change >> } >> >> An arch either specializes all 3 or none of them. >> >> This would allow us to get rid of ptep_get_lockless(). >> >> And it addresses the bug you found with ptep_set_access_flags(). >> >> >> BUT, I still have a nagging feeling that there are likely to be other similar >> problems caused by ignoring access/dirty during pte_same_norecency(). I can't >> convince myself that its definitely all safe and robust. > > Right, we'd have to identify the other possible cases and document what an arch > + common code must stick to to make it work. > > Some rules would be: if an arch implements ptep_get_lockless_norecency(): > > (1) Passing the result from ptep_get_lockless_norecency() to pte_same() > is wrong. > (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after > ptep_get_lockless_norecency() is very likely wrong. > > >> >> So I'm leaning towards dropping patch 3 and therefore keeping >> ptep_get_lockless() around. >> >> Let me know if you have any insight that might help me change my mind :) > > I'm wondering if it would help if we could find a better name (or concept) for > "norecency" here, that expresses that only on some archs we'd have that fuzzy > handling. > > Keeping ptep_get_lockless() around for now sounds like the best alternative. Separately to this I've been playing with an idea to add support for uffd-wp and soft-dirty SW PTE bits for arm64; it boils down to keeping the SW bits in separate storage, linked from the ptdesc. And we have some constant HW PTE bits that we can remove and replace with those SW bits so we can keep the pte_t the same size and abstract it all with ptep_get() and set_ptes(). It was all looking straightforward until I got to ptep_get_lockless(). Now that there are 2 separate locations for PTE bits, I can't read it all atomically. So I've been looking at all this again, and getting myself even more confused. I believe there are 2 classes of ptep_get_lockless() caller: 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() 2) everyone else -- (1) doesn't really care if orig_pte is consistent or not. It just wants to read a value, do some speculative work based on that value, then lock the PTL, and check the value hasn't changed. If it has changed, it backs out. So we don't actually need any "lockless" guarrantees here; we could just use ptep_get(). In fact, prior to Hugh's commit 26e1a0c3277d ("mm: use pmdp_get_lockless() without surplus barrier()"), we had this: vmf->pte = pte_offset_map(vmf->pmd, vmf->address); - vmf->orig_pte = *vmf->pte; + vmf->orig_pte = ptep_get_lockless(vmf->pte); vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; - /* - * some architectures can have larger ptes than wordsize, - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic - * accesses. The code below just needs a consistent view - * for the ifs and we later double check anyway with the - * ptl lock held. So here a barrier will do. - */ - barrier(); if (pte_none(vmf->orig_pte)) { -- (2) All the other users require that a subset of the pte fields are self-consistent; specifically they don't care about access, dirty, uffd-wp or soft-dirty. arm64 can guarrantee that all the other bits are self-consistent just by calling ptep_get(). -- So, I'm making the bold claim that it was never neccessary to specialize pte_get_lockless() on arm64, and I *think* we could just delete it so that ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original aim without needing to introduce "norecency" variants. Additionally I propose documenting ptep_get_lockless() to describe the set of fields that are guarranteed to be self-consistent and the remaining fields which are self-consistent only with best-effort. Could it be this easy? My head is hurting... Thanks, Ryan
[...] Skipping the ptdesc stuff we discussed offline, to not get distracted. :) This stuff is killing me, sorry for the lengthy reply ... > > So I've been looking at all this again, and getting myself even more confused. > > I believe there are 2 classes of ptep_get_lockless() caller: > > 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() > 2) everyone else Likely only completely lockless page table walkers where we *cannot* recheck under PTL is special. Essentially where we disable interrupts and rely on TLB flushes to sync against concurrent changes. Let's take a look where ptep_get_lockless() comes from: commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Nov 13 11:41:40 2020 +0100 mm/gup: Provide gup_get_pte() more generic In order to write another lockless page-table walker, we need gup_get_pte() exposed. While doing that, rename it to ptep_get_lockless() to match the existing ptep_get() naming. GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. "With get_user_pages_fast(), we walk down the pagetables without taking any locks. For this we would like to load the pointers atomically, but sometimes that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we do have is the guarantee that a PTE will only either go from not present to present, or present to not present or both -- it will not switch to a completely different present page without a TLB flush in between; something hat we are blocking by holding interrupts off." Later, we added support for GUP-fast that introduced the !TLB variant: commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 Author: Steve Capper <steve.capper@linaro.org> Date: Thu Oct 9 15:29:14 2014 -0700 mm: introduce a general RCU get_user_pages_fast() With the pattern /* * In the line below we are assuming that the pte can be read * atomically. If this is not the case for your architecture, * please wrap this in a helper function! * * for an example see gup_get_pte in arch/x86/mm/gup.c */ pte_t pte = ACCESS_ONCE(*ptep); ... if (unlikely(pte_val(pte) != pte_val(*ptep))) { ... Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was required to be sane, this the lengthy comment above ptep_get_lockless() that talks about TLB flushes. The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still full of details about TLB flushes syncing against GUP-fast. But as you note, we use it even in contexts where we don't disable interrupts and the TLB flush can't help. We don't disable interrupts during page faults ... so most of the things described in ptep_get_lockless() don't really apply. That's also the reason why ... > > vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > - vmf->orig_pte = *vmf->pte; > + vmf->orig_pte = ptep_get_lockless(vmf->pte); > vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > - /* > - * some architectures can have larger ptes than wordsize, > - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and > - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic > - * accesses. The code below just needs a consistent view > - * for the ifs and we later double check anyway with the > - * ptl lock held. So here a barrier will do. > - */ > - barrier(); > if (pte_none(vmf->orig_pte)) { ... that code was in place. We would possibly read garbage PTEs, but would recheck *under PTL* (where the PTE can no longer change) that what we read wasn't garbage and didn't change. > > -- > > (2) All the other users require that a subset of the pte fields are > self-consistent; specifically they don't care about access, dirty, uffd-wp or > soft-dirty. arm64 can guarrantee that all the other bits are self-consistent > just by calling ptep_get(). Let's focus on access+dirty for now ;) > > -- > > So, I'm making the bold claim that it was never neccessary to specialize > pte_get_lockless() on arm64, and I *think* we could just delete it so that > ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original > aim without needing to introduce "norecency" variants. > > Additionally I propose documenting ptep_get_lockless() to describe the set of > fields that are guarranteed to be self-consistent and the remaining fields which > are self-consistent only with best-effort. > > Could it be this easy? My head is hurting... I think what has to happen is: (1) pte_get_lockless() must return the same value as ptep_get() as long as there are no races. No removal/addition of access/dirty bits etc. (2) Lockless page table walkers that later verify under the PTL can handle serious "garbage PTEs". This is our page fault handler. (3) Lockless page table walkers that cannot verify under PTL cannot handle arbitrary garbage PTEs. This is GUP-fast. Two options: (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes required. This is the common case. HW might concurrently set access/dirty bits, so we can race with that. But we don't read garbage. (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious PTE changes (e.g., present -> present). This is weird x86-PAE. If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. My head is hurting ...
On 10/04/2024 21:09, David Hildenbrand wrote: > [...] > > Skipping the ptdesc stuff we discussed offline, to not get distracted. :) > > This stuff is killing me, sorry for the lengthy reply ... No problem - thanks for taking the time to think it through and reply with such clarity. :) > >> >> So I've been looking at all this again, and getting myself even more confused. >> >> I believe there are 2 classes of ptep_get_lockless() caller: >> >> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() >> 2) everyone else > > Likely only completely lockless page table walkers where we *cannot* recheck > under PTL is special. Essentially where we disable interrupts and rely on TLB > flushes to sync against concurrent changes. Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and "lockless walkers that never take the PTL". Detail: the part about disabling interrupts and TLB flush syncing is arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But you make that clear further down. > > Let's take a look where ptep_get_lockless() comes from: > > commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Nov 13 11:41:40 2020 +0100 > > mm/gup: Provide gup_get_pte() more generic > > In order to write another lockless page-table walker, we need > gup_get_pte() exposed. While doing that, rename it to > ptep_get_lockless() to match the existing ptep_get() naming. > > > GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. > > "With get_user_pages_fast(), we walk down the pagetables without taking any > locks. For this we would like to load the pointers atomically, but sometimes > that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we > do have is the guarantee that a PTE will only either go from not present to > present, or present to not present or both -- it will not switch to a completely > different present page without a TLB flush in between; something hat we are > blocking by holding interrupts off." > > Later, we added support for GUP-fast that introduced the !TLB variant: > > commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 > Author: Steve Capper <steve.capper@linaro.org> > Date: Thu Oct 9 15:29:14 2014 -0700 > > mm: introduce a general RCU get_user_pages_fast() > > With the pattern > > /* > * In the line below we are assuming that the pte can be read > * atomically. If this is not the case for your architecture, > * please wrap this in a helper function! > * > * for an example see gup_get_pte in arch/x86/mm/gup.c > */ > pte_t pte = ACCESS_ONCE(*ptep); > ... > if (unlikely(pte_val(pte) != pte_val(*ptep))) { > ... > > > Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = > gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was > required to be sane, this the lengthy comment above ptep_get_lockless() that > talks about TLB flushes. > > The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still > full of details about TLB flushes syncing against GUP-fast. But as you note, we > use it even in contexts where we don't disable interrupts and the TLB flush > can't help. > > We don't disable interrupts during page faults ... so most of the things > described in ptep_get_lockless() don't really apply. > > That's also the reason why ... >> >> vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> - vmf->orig_pte = *vmf->pte; >> + vmf->orig_pte = ptep_get_lockless(vmf->pte); >> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; >> >> - /* >> - * some architectures can have larger ptes than wordsize, >> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and >> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic >> - * accesses. The code below just needs a consistent view >> - * for the ifs and we later double check anyway with the >> - * ptl lock held. So here a barrier will do. >> - */ >> - barrier(); >> if (pte_none(vmf->orig_pte)) { > > ... that code was in place. We would possibly read garbage PTEs, but would > recheck *under PTL* (where the PTE can no longer change) that what we read > wasn't garbage and didn't change. Agreed. > >> >> -- >> >> (2) All the other users require that a subset of the pte fields are >> self-consistent; specifically they don't care about access, dirty, uffd-wp or >> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent >> just by calling ptep_get(). > > Let's focus on access+dirty for now ;) > >> >> -- >> >> So, I'm making the bold claim that it was never neccessary to specialize >> pte_get_lockless() on arm64, and I *think* we could just delete it so that >> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original >> aim without needing to introduce "norecency" variants. >> >> Additionally I propose documenting ptep_get_lockless() to describe the set of >> fields that are guarranteed to be self-consistent and the remaining fields which >> are self-consistent only with best-effort. >> >> Could it be this easy? My head is hurting... > > I think what has to happen is: > > (1) pte_get_lockless() must return the same value as ptep_get() as long as there > are no races. No removal/addition of access/dirty bits etc. Today's arm64 ptep_get() guarantees this. > > (2) Lockless page table walkers that later verify under the PTL can handle > serious "garbage PTEs". This is our page fault handler. This isn't really a property of a ptep_get_lockless(); its a statement about a class of users. I agree with the statement. > > (3) Lockless page table walkers that cannot verify under PTL cannot handle > arbitrary garbage PTEs. This is GUP-fast. Two options: > > (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the > atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes > required. This is the common case. HW might concurrently set access/dirty bits, > so we can race with that. But we don't read garbage. Today's arm64 ptep_get() cannot garantee that the access/dirty bits are consistent for contpte ptes. That's the bit that complicates the current ptep_get_lockless() implementation. But the point I was trying to make is that GUP-fast does not actually care about *all* the fields being consistent (e.g. access/dirty). So we could spec pte_get_lockless() to say that "all fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured". This could mean that the access/dirty state *does* change for a given page while GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* that failing to detect this is benign. Aside: GUP-fast currently rechecks the pte originally obtained with ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform to (1), so either it returns the same pte or it returns a different pte or garbage. But that garbage could just happen to be the same as the originally obtained pte. So in that case, it would have a false match. I think this needs to be changed to ptep_get_lockless()? > > (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to > read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious > PTE changes (e.g., present -> present). This is weird x86-PAE. > > > If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. > > My head is hurting ... >
On 12/04/2024 21:16, David Hildenbrand wrote: >> >> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >> "lockless walkers that never take the PTL". >> >> Detail: the part about disabling interrupts and TLB flush syncing is >> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >> you make that clear further down. > > Yes, but disabling interrupts is also required for RCU-freeing of page tables > such that they can be walked safely. The TLB flush IPI is arch-specific and > indeed to sync against PTE invalidation (before generic GUP-fast). > [...] > >>>> >>>> Could it be this easy? My head is hurting... >>> >>> I think what has to happen is: >>> >>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there >>> are no races. No removal/addition of access/dirty bits etc. >> >> Today's arm64 ptep_get() guarantees this. >> >>> >>> (2) Lockless page table walkers that later verify under the PTL can handle >>> serious "garbage PTEs". This is our page fault handler. >> >> This isn't really a property of a ptep_get_lockless(); its a statement about a >> class of users. I agree with the statement. > > Yes. That's a requirement for the user of ptep_get_lockless(), such as page > fault handlers. Well, mostly "not GUP". > >> >>> >>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>> >>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the >>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>> required. This is the common case. HW might concurrently set access/dirty bits, >>> so we can race with that. But we don't read garbage. >> >> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >> consistent for contpte ptes. That's the bit that complicates the current >> ptep_get_lockless() implementation. >> >> But the point I was trying to make is that GUP-fast does not actually care about >> *all* the fields being consistent (e.g. access/dirty). So we could spec >> pte_get_lockless() to say that "all fields in the returned pte are guarranteed >> to be self-consistent except for access and dirty information, which may be >> inconsistent if a racing modification occured". > > We *might* have KVM in the future want to check that a PTE is dirty, such that > we can only allow dirty PTEs to be writable in a secondary MMU. That's not there > yet, but one thing I was discussing on the list recently. Burried in: > > https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com > > We wouldn't care about racing modifications, as long as MMU notifiers will > properly notify us when the PTE would lose its dirty bits. > > But getting false-positive dirty bits would be problematic. > >> >> This could mean that the access/dirty state *does* change for a given page while >> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >> that failing to detect this is benign. > > I mean, HW could just set the dirty/access bit immediately after the check. So > if HW concurrently sets the bit and we don't observe that change when we > recheck, I think that would be perfectly fine. Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or soft-dirty or uffd-wp). But if you don't want to change the ptep_get_lockless() spec to explicitly allow this (because you have the KVM use case where false-positive dirty is problematic), then I think we are stuck with ptep_get_lockless() as implemented for arm64 today. > >> >> Aside: GUP-fast currently rechecks the pte originally obtained with >> ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform >> to (1), so either it returns the same pte or it returns a different pte or >> garbage. But that garbage could just happen to be the same as the originally >> obtained pte. So in that case, it would have a false match. I think this needs >> to be changed to ptep_get_lockless()? > > I *think* it's fine, because the case where it would make a difference (x86-PAE) > still requires the TLB flush IPI to sync against PTE changes, and that check > would likely be wrong in one way or the other. So for x86-pae, that check is > just moot either way. > > That my theory, at least. > > (but this "let's fake-read atomically although we don't, but let's do like we > could in some specific circumstances" is really hard to get) > > I was wondering a while ago if we are missing a memory barrier before the checl, > but I think the one from obtaining the page reference gets the job done (at > least that's what I remember). >
On 15/04/2024 11:57, David Hildenbrand wrote: > On 15.04.24 11:28, Ryan Roberts wrote: >> On 12/04/2024 21:16, David Hildenbrand wrote: >>>> >>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>> "lockless walkers that never take the PTL". >>>> >>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >>>> you make that clear further down. >>> >>> Yes, but disabling interrupts is also required for RCU-freeing of page tables >>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>> indeed to sync against PTE invalidation (before generic GUP-fast). >>> [...] >>> >>>>>> >>>>>> Could it be this easy? My head is hurting... >>>>> >>>>> I think what has to happen is: >>>>> >>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>> there >>>>> are no races. No removal/addition of access/dirty bits etc. >>>> >>>> Today's arm64 ptep_get() guarantees this. >>>> >>>>> >>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>> serious "garbage PTEs". This is our page fault handler. >>>> >>>> This isn't really a property of a ptep_get_lockless(); its a statement about a >>>> class of users. I agree with the statement. >>> >>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>> fault handlers. Well, mostly "not GUP". >>> >>>> >>>>> >>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>> >>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the >>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>>>> required. This is the common case. HW might concurrently set access/dirty >>>>> bits, >>>>> so we can race with that. But we don't read garbage. >>>> >>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>> consistent for contpte ptes. That's the bit that complicates the current >>>> ptep_get_lockless() implementation. >>>> >>>> But the point I was trying to make is that GUP-fast does not actually care >>>> about >>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed >>>> to be self-consistent except for access and dirty information, which may be >>>> inconsistent if a racing modification occured". >>> >>> We *might* have KVM in the future want to check that a PTE is dirty, such that >>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there >>> yet, but one thing I was discussing on the list recently. Burried in: >>> >>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>> >>> We wouldn't care about racing modifications, as long as MMU notifiers will >>> properly notify us when the PTE would lose its dirty bits. >>> >>> But getting false-positive dirty bits would be problematic. >>> >>>> >>>> This could mean that the access/dirty state *does* change for a given page >>>> while >>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >>>> that failing to detect this is benign. >>> >>> I mean, HW could just set the dirty/access bit immediately after the check. So >>> if HW concurrently sets the bit and we don't observe that change when we >>> recheck, I think that would be perfectly fine. >> >> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >> soft-dirty or uffd-wp). >> >> But if you don't want to change the ptep_get_lockless() spec to explicitly allow >> this (because you have the KVM use case where false-positive dirty is >> problematic), then I think we are stuck with ptep_get_lockless() as implemented >> for arm64 today. > > At least regarding the dirty bit, we'd have to guarantee that if > ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck > would be able to catch that. > > Would that be possible? Hmm maybe. My head hurts. Let me try to work through some examples... Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in PTE0 for the contpte block, and it is dirty. Now let's say there are 2 racing threads: - ThreadA is doing a GUP-fast for PTE3 - ThreadB is remapping order-0 FolioB at PTE0 (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the example - today's arm64 ptep_get_lockless() can handle the below correctly). ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) READ_ONCE(PTE0) (for a/d) << CLEAN!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) <do speculative work with pte1 content> pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) true = pte_same(pte1, pte2) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, <orig & ~CONT>) WRITE_ONCE(PTE1, <orig & ~CONT>) WRITE_ONCE(PTE2, <orig & ~CONT>) WRITE_ONCE(PTE3, <orig & ~CONT>) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, <new>) This example shows how a *false-negative* can be returned for the dirty state, which isn't detected by the check. I've been unable to come up with an example where a *false-positive* can be returned for dirty state without the second ptep_get_lockless() noticing. In this second example, let's assume everything is the same execpt FolioA is initially clean: ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, <orig & ~CONT>) WRITE_ONCE(PTE1, <orig & ~CONT>) WRITE_ONCE(PTE2, <orig & ~CONT>) WRITE_ONCE(PTE3, <orig & ~CONT>) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, <new>) write to FolioB - HW sets PTE0's dirty READ_ONCE(PTE0) (for a/d) << DIRTY!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) <do speculative work with pte1 content> pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) << BUT THIS IS FOR FolioB READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) false = pte_same(pte1, pte2) << So this fails The only way I can see false-positive not being caught in the second example is if ThreadB subseuently remaps the original folio, so you have an ABA scenario. But these lockless table walkers are already suseptible to that. I think all the same arguments can be extended to the access bit. For me this is all getting rather subtle and difficult to reason about and even harder to spec in a comprehensible way. The best I could come up with is: "All fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured. Additionally it is guranteed that false-positive access and/or dirty information is not possible if 2 calls are made and both ptes are the same. Only false-negative access and/or dirty information is possible in this scenario." which is starting to sound bonkers. Personally I think we are better off at this point, just keeping today's arm64 ptep_get_lockless(). Thanks, Ryan
On 15/04/2024 15:23, David Hildenbrand wrote: > On 15.04.24 15:30, Ryan Roberts wrote: >> On 15/04/2024 11:57, David Hildenbrand wrote: >>> On 15.04.24 11:28, Ryan Roberts wrote: >>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>> >>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>> "lockless walkers that never take the PTL". >>>>>> >>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >>>>>> you make that clear further down. >>>>> >>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables >>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>> [...] >>>>> >>>>>>>> >>>>>>>> Could it be this easy? My head is hurting... >>>>>>> >>>>>>> I think what has to happen is: >>>>>>> >>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>> there >>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>> >>>>>> Today's arm64 ptep_get() guarantees this. >>>>>> >>>>>>> >>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>> >>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>> about a >>>>>> class of users. I agree with the statement. >>>>> >>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>> fault handlers. Well, mostly "not GUP". >>>>> >>>>>> >>>>>>> >>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>> >>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>> the >>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>> bits, >>>>>>> so we can race with that. But we don't read garbage. >>>>>> >>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>> ptep_get_lockless() implementation. >>>>>> >>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>> about >>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>> guarranteed >>>>>> to be self-consistent except for access and dirty information, which may be >>>>>> inconsistent if a racing modification occured". >>>>> >>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that >>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>> there >>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>> >>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>> >>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>> properly notify us when the PTE would lose its dirty bits. >>>>> >>>>> But getting false-positive dirty bits would be problematic. >>>>> >>>>>> >>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>> while >>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >>>>>> that failing to detect this is benign. >>>>> >>>>> I mean, HW could just set the dirty/access bit immediately after the check. So >>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>> recheck, I think that would be perfectly fine. >>>> >>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>> soft-dirty or uffd-wp). >>>> >>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>> allow >>>> this (because you have the KVM use case where false-positive dirty is >>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented >>>> for arm64 today. >>> >>> At least regarding the dirty bit, we'd have to guarantee that if >>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>> would be able to catch that. >>> >>> Would that be possible? >> >> Hmm maybe. My head hurts. Let me try to work through some examples... >> >> >> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in >> PTE0 for the contpte block, and it is dirty. >> >> Now let's say there are 2 racing threads: >> >> - ThreadA is doing a GUP-fast for PTE3 >> - ThreadB is remapping order-0 FolioB at PTE0 >> >> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >> example - today's arm64 ptep_get_lockless() can handle the below correctly). >> >> ThreadA ThreadB >> ======= ======= >> >> gup_pte_range() >> pte1 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> mmap(PTE0) >> clear_pte(PTE0) >> unfold(PTE0 - PTE3) >> WRITE_ONCE(PTE0, 0) >> WRITE_ONCE(PTE1, 0) >> WRITE_ONCE(PTE2, 0) >> READ_ONCE(PTE0) (for a/d) << CLEAN!! >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> <do speculative work with pte1 content> >> pte2 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> READ_ONCE(PTE0) (for a/d) >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> true = pte_same(pte1, pte2) >> WRITE_ONCE(PTE3, 0) >> TLBI >> WRITE_ONCE(PTE0, <orig & ~CONT>) >> WRITE_ONCE(PTE1, <orig & ~CONT>) >> WRITE_ONCE(PTE2, <orig & ~CONT>) >> WRITE_ONCE(PTE3, <orig & ~CONT>) >> WRITE_ONCE(PTE0, 0) >> set_pte_at(PTE0, <new>) >> >> This example shows how a *false-negative* can be returned for the dirty state, >> which isn't detected by the check. >> >> I've been unable to come up with an example where a *false-positive* can be >> returned for dirty state without the second ptep_get_lockless() noticing. In >> this second example, let's assume everything is the same execpt FolioA is >> initially clean: >> >> ThreadA ThreadB >> ======= ======= >> >> gup_pte_range() >> pte1 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) >> mmap(PTE0) >> clear_pte(PTE0) >> unfold(PTE0 - PTE3) >> WRITE_ONCE(PTE0, 0) >> WRITE_ONCE(PTE1, 0) >> WRITE_ONCE(PTE2, 0) >> WRITE_ONCE(PTE3, 0) >> TLBI >> WRITE_ONCE(PTE0, <orig & ~CONT>) >> WRITE_ONCE(PTE1, <orig & ~CONT>) >> WRITE_ONCE(PTE2, <orig & ~CONT>) >> WRITE_ONCE(PTE3, <orig & ~CONT>) >> WRITE_ONCE(PTE0, 0) >> set_pte_at(PTE0, <new>) >> write to FolioB - HW sets PTE0's dirty >> READ_ONCE(PTE0) (for a/d) << DIRTY!! >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> <do speculative work with pte1 content> >> pte2 = ptep_get_lockless(PTE3) >> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >> READ_ONCE(PTE0) (for a/d) >> READ_ONCE(PTE1) (for a/d) >> READ_ONCE(PTE2) (for a/d) >> READ_ONCE(PTE3) (for a/d) >> false = pte_same(pte1, pte2) << So this fails >> >> The only way I can see false-positive not being caught in the second example is >> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >> But these lockless table walkers are already suseptible to that. >> >> I think all the same arguments can be extended to the access bit. >> >> >> For me this is all getting rather subtle and difficult to reason about and even >> harder to spec in a comprehensible way. The best I could come up with is: >> >> "All fields in the returned pte are guarranteed to be self-consistent except for >> access and dirty information, which may be inconsistent if a racing modification >> occured. Additionally it is guranteed that false-positive access and/or dirty >> information is not possible if 2 calls are made and both ptes are the same. Only >> false-negative access and/or dirty information is possible in this scenario." >> >> which is starting to sound bonkers. Personally I think we are better off at this >> point, just keeping today's arm64 ptep_get_lockless(). > > Remind me again, does arm64 perform an IPI broadcast during a TLB flush that > would sync against GUP-fast? No, the broadcast is in HW. There is no IPI.
On 15/04/2024 15:58, David Hildenbrand wrote: > On 15.04.24 16:34, Ryan Roberts wrote: >> On 15/04/2024 15:23, David Hildenbrand wrote: >>> On 15.04.24 15:30, Ryan Roberts wrote: >>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>> >>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>> "lockless walkers that never take the PTL". >>>>>>>> >>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>> TLBIs). But >>>>>>>> you make that clear further down. >>>>>>> >>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>> tables >>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>> [...] >>>>>>> >>>>>>>>>> >>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>> >>>>>>>>> I think what has to happen is: >>>>>>>>> >>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>>>> there >>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>> >>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>> >>>>>>>>> >>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>> >>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>> about a >>>>>>>> class of users. I agree with the statement. >>>>>>> >>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>> >>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>>>> the >>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>> flushes >>>>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>>>> bits, >>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>> >>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>> ptep_get_lockless() implementation. >>>>>>>> >>>>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>>>> about >>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>> guarranteed >>>>>>>> to be self-consistent except for access and dirty information, which may be >>>>>>>> inconsistent if a racing modification occured". >>>>>>> >>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>> that >>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>> there >>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>> >>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>> >>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>> >>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>> >>>>>>>> >>>>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>>>> while >>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>> *think* >>>>>>>> that failing to detect this is benign. >>>>>>> >>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>> check. So >>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>> recheck, I think that would be perfectly fine. >>>>>> >>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>> soft-dirty or uffd-wp). >>>>>> >>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>> allow >>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>> implemented >>>>>> for arm64 today. >>>>> >>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>> would be able to catch that. >>>>> >>>>> Would that be possible? >>>> >>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>> >>>> >>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>> stored in >>>> PTE0 for the contpte block, and it is dirty. >>>> >>>> Now let's say there are 2 racing threads: >>>> >>>> - ThreadA is doing a GUP-fast for PTE3 >>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>> >>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>> >>>> ThreadA ThreadB >>>> ======= ======= >>>> >>>> gup_pte_range() >>>> pte1 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> mmap(PTE0) >>>> clear_pte(PTE0) >>>> unfold(PTE0 - PTE3) >>>> WRITE_ONCE(PTE0, 0) >>>> WRITE_ONCE(PTE1, 0) >>>> WRITE_ONCE(PTE2, 0) >>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> <do speculative work with pte1 content> >>>> pte2 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> READ_ONCE(PTE0) (for a/d) >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> true = pte_same(pte1, pte2) >>>> WRITE_ONCE(PTE3, 0) >>>> TLBI >>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>> WRITE_ONCE(PTE0, 0) >>>> set_pte_at(PTE0, <new>) >>>> >>>> This example shows how a *false-negative* can be returned for the dirty state, >>>> which isn't detected by the check. >>>> >>>> I've been unable to come up with an example where a *false-positive* can be >>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>> this second example, let's assume everything is the same execpt FolioA is >>>> initially clean: >>>> >>>> ThreadA ThreadB >>>> ======= ======= >>>> >>>> gup_pte_range() >>>> pte1 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) >>>> mmap(PTE0) >>>> clear_pte(PTE0) >>>> unfold(PTE0 - PTE3) >>>> WRITE_ONCE(PTE0, 0) >>>> WRITE_ONCE(PTE1, 0) >>>> WRITE_ONCE(PTE2, 0) >>>> WRITE_ONCE(PTE3, 0) >>>> TLBI >>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>> WRITE_ONCE(PTE0, 0) >>>> set_pte_at(PTE0, <new>) >>>> write to FolioB - HW sets PTE0's dirty >>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> <do speculative work with pte1 content> >>>> pte2 = ptep_get_lockless(PTE3) >>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>> READ_ONCE(PTE0) (for a/d) >>>> READ_ONCE(PTE1) (for a/d) >>>> READ_ONCE(PTE2) (for a/d) >>>> READ_ONCE(PTE3) (for a/d) >>>> false = pte_same(pte1, pte2) << So this fails >>>> >>>> The only way I can see false-positive not being caught in the second example is >>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >>>> But these lockless table walkers are already suseptible to that. >>>> >>>> I think all the same arguments can be extended to the access bit. >>>> >>>> >>>> For me this is all getting rather subtle and difficult to reason about and even >>>> harder to spec in a comprehensible way. The best I could come up with is: >>>> >>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>> for >>>> access and dirty information, which may be inconsistent if a racing >>>> modification >>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>> information is not possible if 2 calls are made and both ptes are the same. >>>> Only >>>> false-negative access and/or dirty information is possible in this scenario." >>>> >>>> which is starting to sound bonkers. Personally I think we are better off at >>>> this >>>> point, just keeping today's arm64 ptep_get_lockless(). >>> >>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>> would sync against GUP-fast? >> >> No, the broadcast is in HW. There is no IPI. > > Okay ... > > I agree that the semantics are a bit weird, but if we could get rid of > ptep_get_lockless() on arm64, that would also be nice. > > > Something I've been thinking of ... just to share what I've had in mind. The two > types of users we currently have are: > > (1) ptep_get_lockless() followed by ptep_get() check under PTL. > > (2) ptep_get_lockless() followed by ptep_get() check without PTL. > > What if we had the following instead: > > (1) ptep_get_lockless() followed by ptep_get() check under PTL. > > (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without > PTL. > > And on arm64 let > > (1) ptep_get_lockless() be ptep_get() > > (2) ptep_get_gup_fast() be __ptep_get(). > > That would mean, that (2) would not care if another cont-pte is dirty, because > we don't collect access+dirty bits. That way, we avoid any races with concurrent > unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() > would have to set all cont-PTEs dirty, even if any of these already is dirty. I don't think the "problematic" thing is actually a problem; set_ptes() will always set the dirty bit to the same value for all ptes it covers (and if you do set_ptes() on a partial contpte block, it will be unfolded first). Although I suspect I've misunderstood what you meant there... The potential problem I see with this is that the Arm ARM doesn't specify which PTE of a contpte block the HW stores a/d in. So the HW _could_ update them randomly and this could spuriously increase your check failure rate. In reality I believe most implementations will update the PTE for the address that caused the TLB to be populated. But in some cases, you could have eviction (due to pressure or explicit invalidation) followed by re-population due to faulting on a different page of the contpte block. In this case you would see this type of problem too. But ultimately, isn't this basically equivalent to ptep_get_lockless() returning potentially false-negatives for access and dirty? Just with a much higher chance of getting a false-negative. How is this helping?
On 15.04.24 17:17, Ryan Roberts wrote: > On 15/04/2024 15:58, David Hildenbrand wrote: >> On 15.04.24 16:34, Ryan Roberts wrote: >>> On 15/04/2024 15:23, David Hildenbrand wrote: >>>> On 15.04.24 15:30, Ryan Roberts wrote: >>>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>>> >>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>>> "lockless walkers that never take the PTL". >>>>>>>>> >>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>>> TLBIs). But >>>>>>>>> you make that clear further down. >>>>>>>> >>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>>> tables >>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>>> [...] >>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>>> >>>>>>>>>> I think what has to happen is: >>>>>>>>>> >>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>>>>>>> there >>>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>>> >>>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>>> >>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>>> about a >>>>>>>>> class of users. I agree with the statement. >>>>>>>> >>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>>> >>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if >>>>>>>>>> the >>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>>> flushes >>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty >>>>>>>>>> bits, >>>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>>> >>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>>> ptep_get_lockless() implementation. >>>>>>>>> >>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care >>>>>>>>> about >>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>>> guarranteed >>>>>>>>> to be self-consistent except for access and dirty information, which may be >>>>>>>>> inconsistent if a racing modification occured". >>>>>>>> >>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>>> that >>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>>> there >>>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>>> >>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>>> >>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>>> >>>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>>> >>>>>>>>> >>>>>>>>> This could mean that the access/dirty state *does* change for a given page >>>>>>>>> while >>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>>> *think* >>>>>>>>> that failing to detect this is benign. >>>>>>>> >>>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>>> check. So >>>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>>> recheck, I think that would be perfectly fine. >>>>>>> >>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>>> soft-dirty or uffd-wp). >>>>>>> >>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>>> allow >>>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>>> implemented >>>>>>> for arm64 today. >>>>>> >>>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>>> would be able to catch that. >>>>>> >>>>>> Would that be possible? >>>>> >>>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>>> >>>>> >>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, >>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>>> stored in >>>>> PTE0 for the contpte block, and it is dirty. >>>>> >>>>> Now let's say there are 2 racing threads: >>>>> >>>>> - ThreadA is doing a GUP-fast for PTE3 >>>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>>> >>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>>> >>>>> ThreadA ThreadB >>>>> ======= ======= >>>>> >>>>> gup_pte_range() >>>>> pte1 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> mmap(PTE0) >>>>> clear_pte(PTE0) >>>>> unfold(PTE0 - PTE3) >>>>> WRITE_ONCE(PTE0, 0) >>>>> WRITE_ONCE(PTE1, 0) >>>>> WRITE_ONCE(PTE2, 0) >>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> <do speculative work with pte1 content> >>>>> pte2 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> READ_ONCE(PTE0) (for a/d) >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> true = pte_same(pte1, pte2) >>>>> WRITE_ONCE(PTE3, 0) >>>>> TLBI >>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE0, 0) >>>>> set_pte_at(PTE0, <new>) >>>>> >>>>> This example shows how a *false-negative* can be returned for the dirty state, >>>>> which isn't detected by the check. >>>>> >>>>> I've been unable to come up with an example where a *false-positive* can be >>>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>>> this second example, let's assume everything is the same execpt FolioA is >>>>> initially clean: >>>>> >>>>> ThreadA ThreadB >>>>> ======= ======= >>>>> >>>>> gup_pte_range() >>>>> pte1 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) >>>>> mmap(PTE0) >>>>> clear_pte(PTE0) >>>>> unfold(PTE0 - PTE3) >>>>> WRITE_ONCE(PTE0, 0) >>>>> WRITE_ONCE(PTE1, 0) >>>>> WRITE_ONCE(PTE2, 0) >>>>> WRITE_ONCE(PTE3, 0) >>>>> TLBI >>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>> WRITE_ONCE(PTE0, 0) >>>>> set_pte_at(PTE0, <new>) >>>>> write to FolioB - HW sets PTE0's dirty >>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> <do speculative work with pte1 content> >>>>> pte2 = ptep_get_lockless(PTE3) >>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>>> READ_ONCE(PTE0) (for a/d) >>>>> READ_ONCE(PTE1) (for a/d) >>>>> READ_ONCE(PTE2) (for a/d) >>>>> READ_ONCE(PTE3) (for a/d) >>>>> false = pte_same(pte1, pte2) << So this fails >>>>> >>>>> The only way I can see false-positive not being caught in the second example is >>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario. >>>>> But these lockless table walkers are already suseptible to that. >>>>> >>>>> I think all the same arguments can be extended to the access bit. >>>>> >>>>> >>>>> For me this is all getting rather subtle and difficult to reason about and even >>>>> harder to spec in a comprehensible way. The best I could come up with is: >>>>> >>>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>>> for >>>>> access and dirty information, which may be inconsistent if a racing >>>>> modification >>>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>>> information is not possible if 2 calls are made and both ptes are the same. >>>>> Only >>>>> false-negative access and/or dirty information is possible in this scenario." >>>>> >>>>> which is starting to sound bonkers. Personally I think we are better off at >>>>> this >>>>> point, just keeping today's arm64 ptep_get_lockless(). >>>> >>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>>> would sync against GUP-fast? >>> >>> No, the broadcast is in HW. There is no IPI. >> >> Okay ... >> >> I agree that the semantics are a bit weird, but if we could get rid of >> ptep_get_lockless() on arm64, that would also be nice. >> >> >> Something I've been thinking of ... just to share what I've had in mind. The two >> types of users we currently have are: >> >> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >> >> (2) ptep_get_lockless() followed by ptep_get() check without PTL. >> >> What if we had the following instead: >> >> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >> >> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without >> PTL. >> >> And on arm64 let >> >> (1) ptep_get_lockless() be ptep_get() >> >> (2) ptep_get_gup_fast() be __ptep_get(). >> >> That would mean, that (2) would not care if another cont-pte is dirty, because >> we don't collect access+dirty bits. That way, we avoid any races with concurrent >> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() >> would have to set all cont-PTEs dirty, even if any of these already is dirty. > > I don't think the "problematic" thing is actually a problem; set_ptes() will > always set the dirty bit to the same value for all ptes it covers (and if you do > set_ptes() on a partial contpte block, it will be unfolded first). Although I > suspect I've misunderstood what you meant there... It's more code like that following that I am concerned about. if (pte_dirty()) { /* Great, nothing to do */ } else mte_mkdirty(); set_ptes(); ... } > > The potential problem I see with this is that the Arm ARM doesn't specify which > PTE of a contpte block the HW stores a/d in. So the HW _could_ update them > randomly and this could spuriously increase your check failure rate. In reality > I believe most implementations will update the PTE for the address that caused > the TLB to be populated. But in some cases, you could have eviction (due to > pressure or explicit invalidation) followed by re-population due to faulting on > a different page of the contpte block. In this case you would see this type of > problem too. > > But ultimately, isn't this basically equivalent to ptep_get_lockless() returning > potentially false-negatives for access and dirty? Just with a much higher chance > of getting a false-negative. How is this helping? You are performing an atomic read like GUP-fast wants you to. So there are no races to worry about like on other architectures: HW might *set* the dirty bit concurrently, but that's just fine. The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. To me that sounds ... much simpler ;) But again, just something I've been thinking about. The reuse of pte_get_lockless() outside GUP code might not have been the wisest choice.
On 15/04/2024 16:22, David Hildenbrand wrote: > On 15.04.24 17:17, Ryan Roberts wrote: >> On 15/04/2024 15:58, David Hildenbrand wrote: >>> On 15.04.24 16:34, Ryan Roberts wrote: >>>> On 15/04/2024 15:23, David Hildenbrand wrote: >>>>> On 15.04.24 15:30, Ryan Roberts wrote: >>>>>> On 15/04/2024 11:57, David Hildenbrand wrote: >>>>>>> On 15.04.24 11:28, Ryan Roberts wrote: >>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>>>>>>>> "lockless walkers that never take the PTL". >>>>>>>>>> >>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the >>>>>>>>>> TLBIs). But >>>>>>>>>> you make that clear further down. >>>>>>>>> >>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page >>>>>>>>> tables >>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific >>>>>>>>> and >>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast). >>>>>>>>> [...] >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Could it be this easy? My head is hurting... >>>>>>>>>>> >>>>>>>>>>> I think what has to happen is: >>>>>>>>>>> >>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as >>>>>>>>>>> long as >>>>>>>>>>> there >>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() guarantees this. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can >>>>>>>>>>> handle >>>>>>>>>>> serious "garbage PTEs". This is our page fault handler. >>>>>>>>>> >>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement >>>>>>>>>> about a >>>>>>>>>> class of users. I agree with the statement. >>>>>>>>> >>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as >>>>>>>>> page >>>>>>>>> fault handlers. Well, mostly "not GUP". >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot >>>>>>>>>>> handle >>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>>>>>>>> >>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check >>>>>>>>>>> later if >>>>>>>>>>> the >>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB >>>>>>>>>>> flushes >>>>>>>>>>> required. This is the common case. HW might concurrently set >>>>>>>>>>> access/dirty >>>>>>>>>>> bits, >>>>>>>>>>> so we can race with that. But we don't read garbage. >>>>>>>>>> >>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current >>>>>>>>>> ptep_get_lockless() implementation. >>>>>>>>>> >>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually >>>>>>>>>> care >>>>>>>>>> about >>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are >>>>>>>>>> guarranteed >>>>>>>>>> to be self-consistent except for access and dirty information, which >>>>>>>>>> may be >>>>>>>>>> inconsistent if a racing modification occured". >>>>>>>>> >>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such >>>>>>>>> that >>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not >>>>>>>>> there >>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in: >>>>>>>>> >>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>>>>>>>> >>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will >>>>>>>>> properly notify us when the PTE would lose its dirty bits. >>>>>>>>> >>>>>>>>> But getting false-positive dirty bits would be problematic. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> This could mean that the access/dirty state *does* change for a given >>>>>>>>>> page >>>>>>>>>> while >>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I >>>>>>>>>> *think* >>>>>>>>>> that failing to detect this is benign. >>>>>>>>> >>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the >>>>>>>>> check. So >>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we >>>>>>>>> recheck, I think that would be perfectly fine. >>>>>>>> >>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >>>>>>>> soft-dirty or uffd-wp). >>>>>>>> >>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly >>>>>>>> allow >>>>>>>> this (because you have the KVM use case where false-positive dirty is >>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as >>>>>>>> implemented >>>>>>>> for arm64 today. >>>>>>> >>>>>>> At least regarding the dirty bit, we'd have to guarantee that if >>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck >>>>>>> would be able to catch that. >>>>>>> >>>>>>> Would that be possible? >>>>>> >>>>>> Hmm maybe. My head hurts. Let me try to work through some examples... >>>>>> >>>>>> >>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs >>>>>> 0, 1, >>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is >>>>>> stored in >>>>>> PTE0 for the contpte block, and it is dirty. >>>>>> >>>>>> Now let's say there are 2 racing threads: >>>>>> >>>>>> - ThreadA is doing a GUP-fast for PTE3 >>>>>> - ThreadB is remapping order-0 FolioB at PTE0 >>>>>> >>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the >>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly). >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> READ_ONCE(PTE0) (for a/d) << CLEAN!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> true = pte_same(pte1, pte2) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> >>>>>> This example shows how a *false-negative* can be returned for the dirty >>>>>> state, >>>>>> which isn't detected by the check. >>>>>> >>>>>> I've been unable to come up with an example where a *false-positive* can be >>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In >>>>>> this second example, let's assume everything is the same execpt FolioA is >>>>>> initially clean: >>>>>> >>>>>> ThreadA ThreadB >>>>>> ======= ======= >>>>>> >>>>>> gup_pte_range() >>>>>> pte1 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) >>>>>> mmap(PTE0) >>>>>> clear_pte(PTE0) >>>>>> unfold(PTE0 - PTE3) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> WRITE_ONCE(PTE1, 0) >>>>>> WRITE_ONCE(PTE2, 0) >>>>>> WRITE_ONCE(PTE3, 0) >>>>>> TLBI >>>>>> WRITE_ONCE(PTE0, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE1, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE2, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE3, <orig & ~CONT>) >>>>>> WRITE_ONCE(PTE0, 0) >>>>>> set_pte_at(PTE0, <new>) >>>>>> write to FolioB - HW sets PTE0's dirty >>>>>> READ_ONCE(PTE0) (for a/d) << DIRTY!! >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> <do speculative work with pte1 content> >>>>>> pte2 = ptep_get_lockless(PTE3) >>>>>> READ_ONCE(PTE3) << BUT THIS IS FOR FolioB >>>>>> READ_ONCE(PTE0) (for a/d) >>>>>> READ_ONCE(PTE1) (for a/d) >>>>>> READ_ONCE(PTE2) (for a/d) >>>>>> READ_ONCE(PTE3) (for a/d) >>>>>> false = pte_same(pte1, pte2) << So this fails >>>>>> >>>>>> The only way I can see false-positive not being caught in the second >>>>>> example is >>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA >>>>>> scenario. >>>>>> But these lockless table walkers are already suseptible to that. >>>>>> >>>>>> I think all the same arguments can be extended to the access bit. >>>>>> >>>>>> >>>>>> For me this is all getting rather subtle and difficult to reason about and >>>>>> even >>>>>> harder to spec in a comprehensible way. The best I could come up with is: >>>>>> >>>>>> "All fields in the returned pte are guarranteed to be self-consistent except >>>>>> for >>>>>> access and dirty information, which may be inconsistent if a racing >>>>>> modification >>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty >>>>>> information is not possible if 2 calls are made and both ptes are the same. >>>>>> Only >>>>>> false-negative access and/or dirty information is possible in this scenario." >>>>>> >>>>>> which is starting to sound bonkers. Personally I think we are better off at >>>>>> this >>>>>> point, just keeping today's arm64 ptep_get_lockless(). >>>>> >>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that >>>>> would sync against GUP-fast? >>>> >>>> No, the broadcast is in HW. There is no IPI. >>> >>> Okay ... >>> >>> I agree that the semantics are a bit weird, but if we could get rid of >>> ptep_get_lockless() on arm64, that would also be nice. >>> >>> >>> Something I've been thinking of ... just to share what I've had in mind. The two >>> types of users we currently have are: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_lockless() followed by ptep_get() check without PTL. >>> >>> What if we had the following instead: >>> >>> (1) ptep_get_lockless() followed by ptep_get() check under PTL. >>> >>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without >>> PTL. >>> >>> And on arm64 let >>> >>> (1) ptep_get_lockless() be ptep_get() >>> >>> (2) ptep_get_gup_fast() be __ptep_get(). >>> >>> That would mean, that (2) would not care if another cont-pte is dirty, because >>> we don't collect access+dirty bits. That way, we avoid any races with concurrent >>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes() >>> would have to set all cont-PTEs dirty, even if any of these already is dirty. >> >> I don't think the "problematic" thing is actually a problem; set_ptes() will >> always set the dirty bit to the same value for all ptes it covers (and if you do >> set_ptes() on a partial contpte block, it will be unfolded first). Although I >> suspect I've misunderstood what you meant there... > > It's more code like that following that I am concerned about. > > if (pte_dirty()) { > /* Great, nothing to do */ > } else > mte_mkdirty(); > set_ptes(); > ... > } OK I see, so you're worried about uneccessary unfolding that the false-negative dirty reporting could cause? I think the best solution there would be for the core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in his series at [1]. That would avoid any unfolding and just dirty all contpte block(s) touched by the range. [1] https://lore.kernel.org/linux-mm/20240413002219.71246-1-ioworker0@gmail.com/ > >> >> The potential problem I see with this is that the Arm ARM doesn't specify which >> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >> randomly and this could spuriously increase your check failure rate. In reality >> I believe most implementations will update the PTE for the address that caused >> the TLB to be populated. But in some cases, you could have eviction (due to >> pressure or explicit invalidation) followed by re-population due to faulting on >> a different page of the contpte block. In this case you would see this type of >> problem too. >> >> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning >> potentially false-negatives for access and dirty? Just with a much higher chance >> of getting a false-negative. How is this helping? > > You are performing an atomic read like GUP-fast wants you to. So there are no > races to worry about like on other architectures: HW might *set* the dirty bit > concurrently, but that's just fine. But you can still see false-negatives for access and dirty... > > The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. And I think I convinced myself that you will only see false-negatives with today's arm64 ptep_get(). But an order or magnitude fewer than with your proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those). > > To me that sounds ... much simpler ;) But again, just something I've been > thinking about. OK so this approach upgrades my "I'm fairly sure we never see false-positives" to "we definitely never see false-positives". But it certainly increases the quantity of false-negatives. > > The reuse of pte_get_lockless() outside GUP code might not have been the wisest > choice. > If you want to go down the ptep_get_gup_fast() route, you've still got to be able to spec it, and I think it will land pretty close to my most recent stab at respec'ing ptep_get_lockless() a couple of replies up on this thread. Where would your proposal leave the KVM use case? If you call it ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would be left with ptep_get()... Sorry this thread is getting so long. Just to summarise, I think there are currently 3 solutions on the table: - ptep_get_lockless() remains as is - ptep_get_lockless() wraps ptep_get() - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) Based on discussion so far, that's also the order of my preference. Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? I think its just the potential for looping in the face of concurrent modifications?
>>> The potential problem I see with this is that the Arm ARM doesn't specify which >>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >>> randomly and this could spuriously increase your check failure rate. In reality >>> I believe most implementations will update the PTE for the address that caused >>> the TLB to be populated. But in some cases, you could have eviction (due to >>> pressure or explicit invalidation) followed by re-population due to faulting on >>> a different page of the contpte block. In this case you would see this type of >>> problem too. >>> >>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning >>> potentially false-negatives for access and dirty? Just with a much higher chance >>> of getting a false-negative. How is this helping? >> >> You are performing an atomic read like GUP-fast wants you to. So there are no >> races to worry about like on other architectures: HW might *set* the dirty bit >> concurrently, but that's just fine. > > But you can still see false-negatives for access and dirty... Yes. > >> >> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant. > > And I think I convinced myself that you will only see false-negatives with > today's arm64 ptep_get(). But an order or magnitude fewer than with your > proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those). > >> >> To me that sounds ... much simpler ;) But again, just something I've been >> thinking about. > > OK so this approach upgrades my "I'm fairly sure we never see false-positives" > to "we definitely never see false-positives". But it certainly increases the > quantity of false-negatives. Yes. > >> >> The reuse of pte_get_lockless() outside GUP code might not have been the wisest >> choice. >> > > If you want to go down the ptep_get_gup_fast() route, you've still got to be > able to spec it, and I think it will land pretty close to my most recent stab at > respec'ing ptep_get_lockless() a couple of replies up on this thread. > > Where would your proposal leave the KVM use case? If you call it > ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would > be left with ptep_get()... It's using GUP-fast. > > Sorry this thread is getting so long. Just to summarise, I think there are > currently 3 solutions on the table: > > - ptep_get_lockless() remains as is > - ptep_get_lockless() wraps ptep_get() > - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) > > Based on discussion so far, that's also the order of my preference. (1) seems like the easiest thing to do. > > Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? Well, you sent that patch series with "that aims to reduce the cost and complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :)
Hi David, Sorry for the slow reply on this; its was due to a combination of thinking a bit more about the options here and being out on holiday. On 15/04/2024 17:02, David Hildenbrand wrote: >>>> The potential problem I see with this is that the Arm ARM doesn't specify which >>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them >>>> randomly and this could spuriously increase your check failure rate. In reality >>>> I believe most implementations will update the PTE for the address that caused >>>> the TLB to be populated. But in some cases, you could have eviction (due to >>>> pressure or explicit invalidation) followed by re-population due to faulting on >>>> a different page of the contpte block. In this case you would see this type of >>>> problem too. >>>> >>>> But ultimately, isn't this basically equivalent to ptep_get_lockless() >>>> returning >>>> potentially false-negatives for access and dirty? Just with a much higher >>>> chance >>>> of getting a false-negative. How is this helping? >>> >>> You are performing an atomic read like GUP-fast wants you to. So there are no >>> races to worry about like on other architectures: HW might *set* the dirty bit >>> concurrently, but that's just fine. >> >> But you can still see false-negatives for access and dirty... > > Yes. > >> >>> >>> The whole races you describe with concurrent folding/unfolding/ ... are >>> irrelevant. >> >> And I think I convinced myself that you will only see false-negatives with >> today's arm64 ptep_get(). But an order or magnitude fewer than with your >> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of >> those). >> >>> >>> To me that sounds ... much simpler ;) But again, just something I've been >>> thinking about. >> >> OK so this approach upgrades my "I'm fairly sure we never see false-positives" >> to "we definitely never see false-positives". But it certainly increases the >> quantity of false-negatives. > > Yes. > >> >>> >>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest >>> choice. >>> >> >> If you want to go down the ptep_get_gup_fast() route, you've still got to be >> able to spec it, and I think it will land pretty close to my most recent stab at >> respec'ing ptep_get_lockless() a couple of replies up on this thread. >> >> Where would your proposal leave the KVM use case? If you call it >> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would >> be left with ptep_get()... > > It's using GUP-fast. > >> >> Sorry this thread is getting so long. Just to summarise, I think there are >> currently 3 solutions on the table: >> >> - ptep_get_lockless() remains as is >> - ptep_get_lockless() wraps ptep_get() >> - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename) >> >> Based on discussion so far, that's also the order of my preference. > > (1) seems like the easiest thing to do. Yes, I'm very much in favour of easy. > >> >> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? > > Well, you sent that patch series with "that aims to reduce the cost and > complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) Touche! I'd half forgotten that we were having this conversation in the context of this series! I guess your ptep_get_gup_fast() approach is very similar to ptep_get_lockless_norecency()... So we are back to the beginning :) But ultimately I've come to the conclusion that it is easy to reason about the current arm64 ptep_get_lockless() implementation and see that its correct. The other options both have their drawbacks. Yes, there is a loop in the current implementation that would be nice to get rid of, but I don't think it is really any worse than the cmpxchg loops we already have in other helpers. I'm not planning to persue this any further. Thanks for the useful discussion (as always).
On 23.04.24 12:15, Ryan Roberts wrote: > Hi David, > > Sorry for the slow reply on this; its was due to a combination of thinking a bit > more about the options here and being out on holiday. > No worries, there are things more important in life than ptep_get_lockless() :D >> (1) seems like the easiest thing to do. > > Yes, I'm very much in favour of easy. > >> >>> >>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()? >> >> Well, you sent that patch series with "that aims to reduce the cost and >> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :) > > Touche! I'd half forgotten that we were having this conversation in the context > of this series! > > I guess your ptep_get_gup_fast() approach is very similar to > ptep_get_lockless_norecency()... So we are back to the beginning :) Except that it would be limited to GUP-fast :) > > But ultimately I've come to the conclusion that it is easy to reason about the > current arm64 ptep_get_lockless() implementation and see that its correct. The > other options both have their drawbacks. Yes. > > Yes, there is a loop in the current implementation that would be nice to get rid > of, but I don't think it is really any worse than the cmpxchg loops we already > have in other helpers. > > I'm not planning to persue this any further. Thanks for the useful discussion > (as always). Make sense to me. let's leave it as is for the time being. (and also see if a GUP-fast user that needs precise dirty/accessed actually gets real)