diff mbox

ARM64: kernel oops in 4.4-rc4+

Message ID CACVXFVPgTPYzHj0E7cSx=X+=t=0Am_yBSk6FnTZHeG0hyY6ipg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei Dec. 8, 2015, 1:08 p.m. UTC
On Tue, Dec 8, 2015 at 6:30 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Dec 08, 2015 at 02:30:33PM +0800, Ming Lei wrote:
>> Hi,
>
> [adding Catalin to cc]
>
>> The attached kernel oops can be triggered immediately after
>> running the following command on APM Mustang:
>>
>>         $stress-ng --all 8 -t 10m
>>
>> [1] kernel oops log
>> stress-ng: info:  [5220] 5 failures reached, aborting stress process
>> [  265.782659] kernel BUG at ./arch/arm64/include/asm/pgtable.h:282!
>
> Yikes, this means we're replacing a writable pte with a clean pte, so
> there's a potential race w/ hardware DBM.
>
> Could you dump pte and *ptep please?

They are dumped as so:

set_pte_at: addr 470000, ptep fffffe00bc870238, *ptep 680047348a0bd3,
pte 680047348a0fd3

with the following change:

>
>> [  265.788726] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
>> [  265.794186] Modules linked in:
>> [  265.797241] CPU: 1 PID: 15830 Comm: stress-ng-cache Tainted: G
>>   W       4.4.0-rc4+ #54
>
> What changes do you have on top of -rc4?

Nothing else, and it is just built from the latest linus tree.


Thanks,
Ming Lei

Comments

Will Deacon Dec. 8, 2015, 1:49 p.m. UTC | #1
On Tue, Dec 08, 2015 at 09:08:32PM +0800, Ming Lei wrote:
> On Tue, Dec 8, 2015 at 6:30 PM, Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Dec 08, 2015 at 02:30:33PM +0800, Ming Lei wrote:
> >> The attached kernel oops can be triggered immediately after
> >> running the following command on APM Mustang:
> >>
> >>         $stress-ng --all 8 -t 10m
> >>
> >> [1] kernel oops log
> >> stress-ng: info:  [5220] 5 failures reached, aborting stress process
> >> [  265.782659] kernel BUG at ./arch/arm64/include/asm/pgtable.h:282!
> >
> > Yikes, this means we're replacing a writable pte with a clean pte, so
> > there's a potential race w/ hardware DBM.
> >
> > Could you dump pte and *ptep please?
> 
> They are dumped as so:
> 
> set_pte_at: addr 470000, ptep fffffe00bc870238, *ptep 680047348a0bd3,
> pte 680047348a0fd3

Thanks for dumping these.

It looks like we're trying to set the access flag in the pte, so its
got nothing to do with swp entries (although they may well be broken
anyway with these BUG_ONs). With H/W DBM enabled, we shouldn't be doing
software management of the access flag, so the BUG_ON looks like a red
herring in this case.

I'm not sure on the best fix for this, though. We can either make the
BUG_ON dependent on the hardware supporting DBM or we could override
ptep_set_access_flags to avoid the debug check.

Will
Catalin Marinas Dec. 8, 2015, 4:08 p.m. UTC | #2
On Tue, Dec 08, 2015 at 01:49:52PM +0000, Will Deacon wrote:
> On Tue, Dec 08, 2015 at 09:08:32PM +0800, Ming Lei wrote:
> > On Tue, Dec 8, 2015 at 6:30 PM, Will Deacon <will.deacon@arm.com> wrote:
> > > On Tue, Dec 08, 2015 at 02:30:33PM +0800, Ming Lei wrote:
> > >> The attached kernel oops can be triggered immediately after
> > >> running the following command on APM Mustang:
> > >>
> > >>         $stress-ng --all 8 -t 10m
> > >>
> > >> [1] kernel oops log
> > >> stress-ng: info:  [5220] 5 failures reached, aborting stress process
> > >> [  265.782659] kernel BUG at ./arch/arm64/include/asm/pgtable.h:282!
> > >
> > > Yikes, this means we're replacing a writable pte with a clean pte, so
> > > there's a potential race w/ hardware DBM.
> > >
> > > Could you dump pte and *ptep please?
> > 
> > They are dumped as so:
> > 
> > set_pte_at: addr 470000, ptep fffffe00bc870238, *ptep 680047348a0bd3,
> > pte 680047348a0fd3
> 
> Thanks for dumping these.
> 
> It looks like we're trying to set the access flag in the pte, so its
> got nothing to do with swp entries (although they may well be broken
> anyway with these BUG_ONs). With H/W DBM enabled, we shouldn't be doing
> software management of the access flag, so the BUG_ON looks like a red
> herring in this case.

Setting the access flag is fine, clearing it is a potential problem (the
first BUG_ON). If the line numbers match mainline, 282 means the second
BUG_ON in set_pte_at(). The ptep_set_access_flags() is meant to update
the access or dirty flags without corrupting the state. I think the
scenario is:

1. *ptep is writable and clean and old (PTE_WRITE && PTE_RDONLY && !PTE_AF)
2. ptep_set_access_flags() tries to set PTE_AF while keeping the rest
   unchanged
3. BUG_ON triggers because a writable PTE is overridden with a writeable
   && clean PTE, potentially overriding a hardware-updated dirty entry

Point 3 could easily cause us problems if you mix DBM and non-DBM agents
in the same system.

Another scenario when all agents can do DBM is for two CPUs to take a
read fault on a non-present PTE. The first CPU that takes the mm
semaphore updates the PTE from non-present to valid && writable && clean
&& young. The second CPU comes in and finds the PTE present in
handle_pte_fault() and goes on to do a ptep_set_access_flags(). At this
point we have a race with any other agent updating the dirty status.

I think the BUG_ON check is still useful, we just need at atomic
implementation of ptep_set_access_flags(). I'll post something and take
it from there (I need to think some more about how this would interact
with the set_pte_at() checks for the dirty bit).

BTW, I don't think the check for pte_valid(pte) is needed in
set_pte_at() as we should never go from valid pte to file without
break-before-make. With proper b-b-m, pte_valid(*ptep) is false, so no
BUG_ON checks.
diff mbox

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7e074f9..1c5f0ee 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -278,6 +278,12 @@  static inline void set_pte_at(struct mm_struct *mm, unsigne
         */
        if (IS_ENABLED(CONFIG_DEBUG_VM) && IS_ENABLED(CONFIG_ARM64_HW_AFDBM) &&
            pte_valid(*ptep)) {
+
+               if (!pte_young(pte) || (pte_write(*ptep) && !pte_dirty(pte)))
+                       trace_printk("%s: addr %lx, ptep %p, *ptep %llx, pte %ll
+                               __func__, addr, ptep,
+                               (unsigned long long)pte_val(*ptep),
+                               (unsigned long long )pte_val(pte));
                BUG_ON(!pte_young(pte));
                BUG_ON(pte_write(*ptep) && !pte_dirty(pte));