diff mbox

A few other cache related optimizations for Cortex-A9.

Message ID CAN_5kQBszi=hV1RVjyKO6gOhOuymGjsMwLk6ORaWpkaL-4USxA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

heechul Yun July 6, 2011, 3:14 a.m. UTC
I found a few other places which, I believe, are not necessary for Cortex-A9.


Creating page tables also do not need to clean cache-line because of
the same reason as above.
This patch improves lmbench3 (fork/exec/shell) performance by 10%~20%
in my test.

I think above two patches work for least Cortex-A9 although I am not
sure the use of CONFIG_CPU_CACHE_V7 is appropriate.

Thanks

Comments

Catalin Marinas July 6, 2011, 8:56 a.m. UTC | #1
On Wed, Jul 06, 2011 at 04:14:57AM +0100, heechul Yun wrote:
> I found a few other places which, I believe, are not necessary for Cortex-A9.
> 
> diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
> index bdba6c6..6d5a847 100644
> --- a/arch/arm/mm/copypage-v6.c
> +++ b/arch/arm/mm/copypage-v6.c
> @@ -41,7 +41,9 @@ static void v6_copy_user_highpage_nonaliasing(struct page *to,
>         kfrom = kmap_atomic(from, KM_USER0);
>         kto = kmap_atomic(to, KM_USER1);
>         copy_page(kto, kfrom);
> +#ifndef CONFIG_CPU_CACHE_V7
>         __cpuc_flush_dcache_area(kto, PAGE_SIZE);
> +#endif
>         kunmap_atomic(kto, KM_USER1);
>         kunmap_atomic(kfrom, KM_USER0);
>  }
> 
> On handling COW page fault, the above function is called to copy the
> page content of the parent to a newly allocate page frame for the
> child. Again, since D cache of A9 is PIPT, we do not need to flush the
> page as in x86. This modification improves lmbench (fork/exec/shell)
> performance by 4-6%.

See commit 115b2247 introducing this. We indeed have a PIPT like cache
on A9 but it is a Harvard architecture with separate I and D caches. It
happened in the past that we got a COW for text page and the I and D
cache became incoherent. Since then, the dynamic linker has been fixed
and no longer causes this. We could add a check for VM_EXEC in
vma->vm_flags.

But I wonder whether we still need this flush after commit c0177800
where we assume that a new page cache page has dirty D-cache (and we
later flush the caches via set_pte_at).

> I think above two patches work for least Cortex-A9 although I am not
> sure the use of CONFIG_CPU_CACHE_V7 is appropriate.

We need to check the ID_MMFR1 register as there are other ARMv7 cores
that cannot do page table walks in the L1 cache.
Russell King - ARM Linux July 6, 2011, 7:30 p.m. UTC | #2
On Wed, Jul 06, 2011 at 09:56:56AM +0100, Catalin Marinas wrote:
> On Wed, Jul 06, 2011 at 04:14:57AM +0100, heechul Yun wrote:
> > I found a few other places which, I believe, are not necessary for Cortex-A9.
> > 
> > diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
> > index bdba6c6..6d5a847 100644
> > --- a/arch/arm/mm/copypage-v6.c
> > +++ b/arch/arm/mm/copypage-v6.c
> > @@ -41,7 +41,9 @@ static void v6_copy_user_highpage_nonaliasing(struct page *to,
> >         kfrom = kmap_atomic(from, KM_USER0);
> >         kto = kmap_atomic(to, KM_USER1);
> >         copy_page(kto, kfrom);
> > +#ifndef CONFIG_CPU_CACHE_V7
> >         __cpuc_flush_dcache_area(kto, PAGE_SIZE);
> > +#endif
> >         kunmap_atomic(kto, KM_USER1);
> >         kunmap_atomic(kfrom, KM_USER0);
> >  }
> > 
> > On handling COW page fault, the above function is called to copy the
> > page content of the parent to a newly allocate page frame for the
> > child. Again, since D cache of A9 is PIPT, we do not need to flush the
> > page as in x86. This modification improves lmbench (fork/exec/shell)
> > performance by 4-6%.
> 
> See commit 115b2247 introducing this. We indeed have a PIPT like cache
> on A9 but it is a Harvard architecture with separate I and D caches. It
> happened in the past that we got a COW for text page and the I and D
> cache became incoherent. Since then, the dynamic linker has been fixed
> and no longer causes this. We could add a check for VM_EXEC in
> vma->vm_flags.
> 
> But I wonder whether we still need this flush after commit c0177800
> where we assume that a new page cache page has dirty D-cache (and we
> later flush the caches via set_pte_at).

I don't think we need that flush there after c0177800 either.  I/D
coherency implies that pte_exec() is set, which will get us through
to the checking of PG_arch_1 in __sync_icache_dcache(), where we'll
call __flush_dcache_page for this page.

We don't need this flush anymore, so let's simply kill it outright.

Heechul (sorry, is that the correct way of addressing you?) could
you please submit a patch removing the __cpuc_flush_dcache_area()
from v6_copy_user_highpage_nonaliasing() entirely please?

Thanks.
heechul Yun July 7, 2011, 2:53 a.m. UTC | #3
>
> We don't need this flush anymore, so let's simply kill it outright.
>
> Heechul (sorry, is that the correct way of addressing you?) could
> you please submit a patch removing the __cpuc_flush_dcache_area()
> from v6_copy_user_highpage_nonaliasing() entirely please?
>

I sent the patch

Thanks

Heechul
diff mbox

Patch

diff --git a/arch/arm/mm/copypage-v6.c b/arch/arm/mm/copypage-v6.c
index bdba6c6..6d5a847 100644
--- a/arch/arm/mm/copypage-v6.c
+++ b/arch/arm/mm/copypage-v6.c
@@ -41,7 +41,9 @@  static void v6_copy_user_highpage_nonaliasing(struct page *to,
        kfrom = kmap_atomic(from, KM_USER0);
        kto = kmap_atomic(to, KM_USER1);
        copy_page(kto, kfrom);
+#ifndef CONFIG_CPU_CACHE_V7
        __cpuc_flush_dcache_area(kto, PAGE_SIZE);
+#endif
        kunmap_atomic(kto, KM_USER1);
        kunmap_atomic(kfrom, KM_USER0);
 }

On handling COW page fault, the above function is called to copy the
page content of the parent to a newly allocate page frame for the
child. Again, since D cache of A9 is PIPT, we do not need to flush the
page as in x86. This modification improves lmbench (fork/exec/shell)
performance by 4-6%.

diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index b12cc98..bff9858 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -61,7 +61,9 @@  pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)

        pte = (pte_t *)__get_free_page(PGALLOC_GFP);
        if (pte) {
+#if !CONFIG_CPU_CACHE_V7
                clean_dcache_area(pte, sizeof(pte_t) * PTRS_PER_PTE);
+#endif
                pte += PTRS_PER_PTE;
        }

@@ -81,7 +83,9 @@  pte_alloc_one(struct mm_struct *mm, unsigned long addr)
        if (pte) {
                if (!PageHighMem(pte)) {
                        void *page = page_address(pte);
+#if !CONFIG_CPU_CACHE_V7
                        clean_dcache_area(page, sizeof(pte_t) * PTRS_PER_PTE);
+#endif
                }
                pgtable_page_ctor(pte);
        }
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index be5f58e..343df1b 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -41,8 +41,9 @@  pgd_t *get_pgd_slow(struct mm_struct *mm)
        memcpy(new_pgd + FIRST_KERNEL_PGD_NR, init_pgd + FIRST_KERNEL_PGD_NR,
                       (PTRS_PER_PGD - FIRST_KERNEL_PGD_NR) * sizeof(pgd_t));

+#if !CONFIG_CPU_CACHE_V7
        clean_dcache_area(new_pgd, PTRS_PER_PGD * sizeof(pgd_t));
-
+#endif
        if (!vectors_high()) {
                /*
                 * On ARM, first page must always be allocated since it