Message ID | 20180822154046.772017055@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86: TLB invalidate fixes | expand |
On Wed, 22 Aug 2018 17:30:14 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > Will noted that only checking mm_users is incorrect; we should also > check mm_count in order to cover CPUs that have a lazy reference to > this mm (and could do speculative TLB operations). Why is that incorrect? This shortcut has nothing to do with no TLBs -- not sure about x86, but other CPUs can certainly have remaining TLBs here, speculative operations or not (even if they don't have an mm_count ref they can have TLBs here). So that leaves speculative operations. I don't see where the problem is with those either -- this shortcut needs to ensure there are no other *non speculative* operations. mm_users is correct for that. If there is a speculation security problem here it should be carefully documented otherwise it's going to be re-introduced... I actually have a patch to extend this optimisation further that I'm going to send out again today. It's nice to avoid the double handling of the pages. Thanks, Nick > > If removing this turns out to be a performance issue, we can > re-instate a more complete check, but in tlb_table_flush() eliding the > call_rcu_sched(). > > Cc: stable@kernel.org > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: David Miller <davem@davemloft.net> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code") > Reported-by: Will Deacon <will.deacon@arm.com> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > mm/memory.c | 9 --------- > 1 file changed, 9 deletions(-) > > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -375,15 +375,6 @@ void tlb_remove_table(struct mmu_gather > { > struct mmu_table_batch **batch = &tlb->batch; > > - /* > - * When there's less then two users of this mm there cannot be a > - * concurrent page-table walk. > - */ > - if (atomic_read(&tlb->mm->mm_users) < 2) { > - __tlb_remove_table(table); > - return; > - } > - > if (*batch == NULL) { > *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN); > if (*batch == NULL) { > >
On Wed, Aug 22, 2018 at 8:31 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > So that leaves speculative operations. I don't see where the problem is > with those either -- this shortcut needs to ensure there are no other > *non speculative* operations. mm_users is correct for that. No. Because mm_users doesn't contain any lazy tlb users. And yes, those lazy tlbs are all kernel threads, but they can still speculatively load user addresses. Linus
On Wed, Aug 22, 2018 at 8:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > No. Because mm_users doesn't contain any lazy tlb users. .. or, as it turns out, the use_mm() case either, which can do gup_fast(). Oh well. Linus
On Wed, 22 Aug 2018 20:35:16 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Aug 22, 2018 at 8:31 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > > > > So that leaves speculative operations. I don't see where the problem is > > with those either -- this shortcut needs to ensure there are no other > > *non speculative* operations. mm_users is correct for that. > > No. Because mm_users doesn't contain any lazy tlb users. > > And yes, those lazy tlbs are all kernel threads, but they can still > speculatively load user addresses. So? If the arch does not shoot those all down after the user page tables are removed then it's buggy regardless of this short cut. The only real problem I could see would be if a page walk cache still points to the freed table, then the table gets re-allocated and used elsewhere, and meanwhile a speculative access tries to load an entry from the page that is an invalid form of page table that might cause a machine check or something. That would be (u)arch specific, but if that's what we're concerned with here it's a different issue and needs to be documented as such. I'll have a look at powerpc and see if we can cope with it. If so, I'll make it an arch specific opt-in short cut. Thanks, Nick
On Wed, Aug 22, 2018 at 9:16 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > On Wed, 22 Aug 2018 20:35:16 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > And yes, those lazy tlbs are all kernel threads, but they can still > > speculatively load user addresses. > > So? > > If the arch does not shoot those all down after the user page tables > are removed then it's buggy regardless of this short cut. Read the code. That shortcut frees the pages *WITHOUT* doing that TLB flush. It just does __tlb_remove_table(), which does *not* do that whole page queueing so that we can batch flush and then free later. So the fast-case it's buggy, exactly because of the reason you state. The logic used to be that if you were the only cpu that had that tlb, and it was a mid-level table, it didn't need that synchronization at all. And that logic is simply wrong. Exactly because even if mm_users is 1, there can be things looking up TLB entries on other CPU's. Either because of a lazy mm and a hw walker with speculation, or because of use_mm() and a software walker. So the whole "free immediately" shortcut was bogus. You *always* need to queue, then flush the tlb, and then free. That said, that wasn't actually the bug that Jann found. He found the bug a few lines lower down, where the "I can't allocate memory for queueing" ended up *also* not flushing the TLB. > The only real problem I could see would be if a page walk cache still > points to the freed table, then the table gets re-allocated and used > elsewhere, and meanwhile a speculative access tries to load an entry > from the page that is an invalid form of page table that might cause > a machine check or something. That would be (u)arch specific, but if > that's what we're concerned with here it's a different issue and needs > to be documented as such. We've *seen* that case, we had exactly that when we were being aggressive about trying to avoid flushes for the lazy mmu case entirely, because "we can just flush when we activate the lazy mm state instead". The logic was actually solid from a TLB case - who cares what garbage TLB entries we might speculatively fill, when we're going to flush them before they can possibly be used. It turns out that that logic is solid, but hits another problem: at least some AMD CPU's will cause machine checks when the TLB entries are inconsistent with the machine setup. That can't happen with a *good* page table, but when you speculatively walk already freed tables, you can get any garbage. I forget what the exact trigger was, but this was actually reported. So you can't free page directory pages without flushing the tlb first (to make that internal tlb node caches are flushed). So the logic for freeing leaf pages and freeing middle nodes has to be exactly the same: you make the modification to the page table to remove the node/leaf, you queue up the memory for freeing, you flush the tlb, and you then free the page. That's the ordering that tlb_remove_page() has always honored, but that tlb_remove_tabl() didn't. It honored it for the *normal* case, which is why it took so long to notice that the TLB shootdown had been broken on x86 when it moved to the "generic" code. The *normal* case does this all right, and batches things up, and then when the batch fills up it does a tlb_table_flush() which does the TLB flush and schedules the actual freeing. But there were two cases that *didn't* do that. The special "I'm the only thread" fast case, and the "oops I ran out of memory, so now I'll fake it, and just synchronize with page twalkers manually, and then do that special direct remove without flushing the tlb". NOTE! Jann triggered that one by (a) forcing the machine low on memory (b) force-poisoning the page tables immediately after free I suspect it's really hard to trigger under normal loads, exactly because the *normal* case gets it right. It's literally the "oops, I can't batch up because I ran out of memory" case that gets it wrong. (And the special single-thread + lazy or use_mm() case, but that's going to be entirely impossible to trigger, because in practice it's just a single thread, and you won't ever hit the magic timing needed that frees the page in the single thread at exactly the same time that some optimistic lazy mm on another cpu happens to speculatively load that address). So the "atomic_read(&mm_users)" case is likely entirely impossible to trigger any sane way. But because Jann found the problem with the 'ran out of memory" case, we started looking at the more theoretical cases that matched the same kind of "no tlb flush before free" pattern. Linus
On Wed, 22 Aug 2018 21:54:48 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Aug 22, 2018 at 9:16 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > > On Wed, 22 Aug 2018 20:35:16 -0700 > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > And yes, those lazy tlbs are all kernel threads, but they can still > > > speculatively load user addresses. > > > > So? > > > > If the arch does not shoot those all down after the user page tables > > are removed then it's buggy regardless of this short cut. > > Read the code. > > That shortcut frees the pages *WITHOUT* doing that TLB flush. I've read the code and I understand that. Hence I'm asking because I did't think the changelog matched the change. But possibly it was because I actually didn't read the changelog enough -- I guess it does say the TLB operation is the problem. I may have got side tracked by the word speculativee. > It just > does __tlb_remove_table(), which does *not* do that whole page > queueing so that we can batch flush and then free later. > > So the fast-case it's buggy, exactly because of the reason you state. Okay sure, thanks for confirming it for me. I would ask for changelog to be slightly expanded on, but maybe it's just my reading comprehension needs improvement... > > The only real problem I could see would be if a page walk cache still > > points to the freed table, then the table gets re-allocated and used > > elsewhere, and meanwhile a speculative access tries to load an entry > > from the page that is an invalid form of page table that might cause > > a machine check or something. That would be (u)arch specific, but if > > that's what we're concerned with here it's a different issue and needs > > to be documented as such. > > We've *seen* that case, we had exactly that when we were being > aggressive about trying to avoid flushes for the lazy mmu case > entirely, because "we can just flush when we activate the lazy mm > state instead". > > The logic was actually solid from a TLB case - who cares what garbage > TLB entries we might speculatively fill, when we're going to flush > them before they can possibly be used. > > It turns out that that logic is solid, but hits another problem: at > least some AMD CPU's will cause machine checks when the TLB entries > are inconsistent with the machine setup. That can't happen with a > *good* page table, but when you speculatively walk already freed > tables, you can get any garbage. Yeah that does make sense. > > I forget what the exact trigger was, but this was actually reported. > So you can't free page directory pages without flushing the tlb first > (to make that internal tlb node caches are flushed). > > So the logic for freeing leaf pages and freeing middle nodes has to be > exactly the same: you make the modification to the page table to > remove the node/leaf, you queue up the memory for freeing, you flush > the tlb, and you then free the page. > > That's the ordering that tlb_remove_page() has always honored, but > that tlb_remove_tabl() didn't. > > It honored it for the *normal* case, which is why it took so long to > notice that the TLB shootdown had been broken on x86 when it moved to > the "generic" code. The *normal* case does this all right, and batches > things up, and then when the batch fills up it does a > tlb_table_flush() which does the TLB flush and schedules the actual > freeing. > > But there were two cases that *didn't* do that. The special "I'm the > only thread" fast case, and the "oops I ran out of memory, so now I'll > fake it, and just synchronize with page twalkers manually, and then do > that special direct remove without flushing the tlb". > > NOTE! Jann triggered that one by > (a) forcing the machine low on memory > (b) force-poisoning the page tables immediately after free > > I suspect it's really hard to trigger under normal loads, exactly > because the *normal* case gets it right. It's literally the "oops, I > can't batch up because I ran out of memory" case that gets it wrong. > > (And the special single-thread + lazy or use_mm() case, but that's > going to be entirely impossible to trigger, because in practice it's > just a single thread, and you won't ever hit the magic timing needed > that frees the page in the single thread at exactly the same time that > some optimistic lazy mm on another cpu happens to speculatively load > that address). > > So the "atomic_read(&mm_users)" case is likely entirely impossible to > trigger any sane way. But because Jann found the problem with the 'ran > out of memory" case, we started looking at the more theoretical cases > that matched the same kind of "no tlb flush before free" pattern. Thanks for giving that background. In that case I'm happy with this fix. Thanks, Nick
On Wed, Aug 22, 2018 at 05:30:14PM +0200, Peter Zijlstra wrote: > Will noted that only checking mm_users is incorrect; we should also > check mm_count in order to cover CPUs that have a lazy reference to > this mm (and could do speculative TLB operations). > > If removing this turns out to be a performance issue, we can > re-instate a more complete check, but in tlb_table_flush() eliding the > call_rcu_sched(). > > Cc: stable@kernel.org > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: David Miller <davem@davemloft.net> > Cc: Will Deacon <will.deacon@arm.com> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code") > Reported-by: Will Deacon <will.deacon@arm.com> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > mm/memory.c | 9 --------- > 1 file changed, 9 deletions(-) Acked-by: Will Deacon <will.deacon@arm.com> Cheers, Will
On Wed, Aug 22, 2018 at 09:54:48PM -0700, Linus Torvalds wrote: > It honored it for the *normal* case, which is why it took so long to > notice that the TLB shootdown had been broken on x86 when it moved to > the "generic" code. The *normal* case does this all right, and batches > things up, and then when the batch fills up it does a > tlb_table_flush() which does the TLB flush and schedules the actual > freeing. > > But there were two cases that *didn't* do that. The special "I'm the > only thread" fast case, and the "oops I ran out of memory, so now I'll > fake it, and just synchronize with page twalkers manually, and then do > that special direct remove without flushing the tlb". The actual RCU batching case was also busted; there was no guarantee that by the time we run the RCU callbacks the invalidate would've happened. Exceedingly unlikely, but no guarantee. So really, all 3 cases in tlb_remove_table() were busted in this respect.
--- a/mm/memory.c +++ b/mm/memory.c @@ -375,15 +375,6 @@ void tlb_remove_table(struct mmu_gather { struct mmu_table_batch **batch = &tlb->batch; - /* - * When there's less then two users of this mm there cannot be a - * concurrent page-table walk. - */ - if (atomic_read(&tlb->mm->mm_users) < 2) { - __tlb_remove_table(table); - return; - } - if (*batch == NULL) { *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN); if (*batch == NULL) {
Will noted that only checking mm_users is incorrect; we should also check mm_count in order to cover CPUs that have a lazy reference to this mm (and could do speculative TLB operations). If removing this turns out to be a performance issue, we can re-instate a more complete check, but in tlb_table_flush() eliding the call_rcu_sched(). Cc: stable@kernel.org Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Will Deacon <will.deacon@arm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Fixes: 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code") Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- mm/memory.c | 9 --------- 1 file changed, 9 deletions(-)