Message ID | 20240412131908.433043-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | Speed up boot with faster linear map creation | expand |
On Fri, Apr 12, 2024 at 05:06:41PM +0100, Will Deacon wrote: > On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote: > > It turns out that creating the linear map can take a significant proportion of > > the total boot time, especially when rodata=full. And most of the time is spent > > waiting on superfluous tlb invalidation and memory barriers. This series reworks > > the kernel pgtable generation code to significantly reduce the number of those > > TLBIs, ISBs and DSBs. See each patch for details. > > > > The below shows the execution time of map_mem() across a couple of different > > systems with different RAM configurations. We measure after applying each patch > > and show the improvement relative to base (v6.9-rc2): > > > > [...] > > Applied to arm64 (for-next/mm), thanks! > > [1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block > https://git.kernel.org/arm64/c/5c63db59c5f8 > [2/3] arm64: mm: Batch dsb and isb when populating pgtables > https://git.kernel.org/arm64/c/1fcb7cea8a5f > [3/3] arm64: mm: Don't remap pgtables for allocate vs populate > https://git.kernel.org/arm64/c/0e9df1c905d8 I confirm this series boots the system on FVP (with my .config and my buildroot rootfs using Shrinkwrap). Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Thanks, Itaru. > > Cheers, > -- > Will > > https://fixes.arm64.dev > https://next.arm64.dev > https://will.arm64.dev
On Fri, Apr 12, 2024 at 02:19:05PM +0100, Ryan Roberts wrote: > Hi All, > > It turns out that creating the linear map can take a significant proportion of > the total boot time, especially when rodata=full. And most of the time is spent > waiting on superfluous tlb invalidation and memory barriers. This series reworks > the kernel pgtable generation code to significantly reduce the number of those > TLBIs, ISBs and DSBs. See each patch for details. > > The below shows the execution time of map_mem() across a couple of different > systems with different RAM configurations. We measure after applying each patch > and show the improvement relative to base (v6.9-rc2): > > | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra > | VM, 16G | VM, 64G | VM, 256G | Metal, 512G > ---------------|-------------|-------------|-------------|------------- > | ms (%) | ms (%) | ms (%) | ms (%) > ---------------|-------------|-------------|-------------|------------- > base | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%) > no-cont-remap | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%) > batch-barriers | 11 (-93%) | 161 (-93%) | 656 (-92%) | 1654 (-91%) > no-alloc-remap | 10 (-94%) | 104 (-95%) | 438 (-95%) | 1223 (-93%) > > This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and > boot tested various PAGE_SIZE and VA size configs. Nice! > Ryan Roberts (3): > arm64: mm: Don't remap pgtables per-cont(pte|pmd) block > arm64: mm: Batch dsb and isb when populating pgtables > arm64: mm: Don't remap pgtables for allocate vs populate For the series: Reviewed-by: Mark Rutland <mark.rutland@arm.com> Catalin, Will, are you happy to pick this up? Mark.
On Fri, 12 Apr 2024 at 15:19, Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > It turns out that creating the linear map can take a significant proportion of > the total boot time, especially when rodata=full. And most of the time is spent > waiting on superfluous tlb invalidation and memory barriers. This series reworks > the kernel pgtable generation code to significantly reduce the number of those > TLBIs, ISBs and DSBs. See each patch for details. > > The below shows the execution time of map_mem() across a couple of different > systems with different RAM configurations. We measure after applying each patch > and show the improvement relative to base (v6.9-rc2): > > | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra > | VM, 16G | VM, 64G | VM, 256G | Metal, 512G > ---------------|-------------|-------------|-------------|------------- > | ms (%) | ms (%) | ms (%) | ms (%) > ---------------|-------------|-------------|-------------|------------- > base | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%) > no-cont-remap | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%) > batch-barriers | 11 (-93%) | 161 (-93%) | 656 (-92%) | 1654 (-91%) > no-alloc-remap | 10 (-94%) | 104 (-95%) | 438 (-95%) | 1223 (-93%) > > This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and > boot tested various PAGE_SIZE and VA size configs. ... > > Ryan Roberts (3): > arm64: mm: Don't remap pgtables per-cont(pte|pmd) block > arm64: mm: Batch dsb and isb when populating pgtables > arm64: mm: Don't remap pgtables for allocate vs populate > For the series, Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote: > It turns out that creating the linear map can take a significant proportion of > the total boot time, especially when rodata=full. And most of the time is spent > waiting on superfluous tlb invalidation and memory barriers. This series reworks > the kernel pgtable generation code to significantly reduce the number of those > TLBIs, ISBs and DSBs. See each patch for details. > > The below shows the execution time of map_mem() across a couple of different > systems with different RAM configurations. We measure after applying each patch > and show the improvement relative to base (v6.9-rc2): > > [...] Applied to arm64 (for-next/mm), thanks! [1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block https://git.kernel.org/arm64/c/5c63db59c5f8 [2/3] arm64: mm: Batch dsb and isb when populating pgtables https://git.kernel.org/arm64/c/1fcb7cea8a5f [3/3] arm64: mm: Don't remap pgtables for allocate vs populate https://git.kernel.org/arm64/c/0e9df1c905d8 Cheers,