Message ID | 20210127172706.617195-1-pasha.tatashin@soleen.com (mailing list archive) |
---|---|
Headers | show |
Series | arm64: MMU enabled kexec relocation | expand |
Hi Pavel, On 27/01/2021 17:27, Pavel Tatashin wrote: > Enable MMU during kexec relocation in order to improve reboot performance. > > If kexec functionality is used for a fast system update, with a minimal > downtime, the relocation of kernel + initramfs takes a significant portion > of reboot. > > The reason for slow relocation is because it is done without MMU, and thus > not benefiting from D-Cache. > > Performance data > ---------------- > For this experiment, the size of kernel plus initramfs is small, only 25M. > If initramfs was larger, than the improvements would be greater, as time > spent in relocation is proportional to the size of relocation. > > Previously: > kernel shutdown 0.022131328s > relocation 0.440510736s > kernel startup 0.294706768s > > Relocation was taking: 58.2% of reboot time > > Now: > kernel shutdown 0.032066576s > relocation 0.022158152s > kernel startup 0.296055880s > > Now: Relocation takes 6.3% of reboot time > > Total reboot is x2.16 times faster. > > With bigger userland (fitImage 380M), the reboot time is improved by 3.57s, > and is reduced from 3.9s down to 0.33s > Previous approaches and discussions > ----------------------------------- The problem I see with this is rewriting the relocation code. It needs to work whether the machine has enough memory to enable the MMU during kexec, or not. In off-list mail to Pavel I proposed an alternative implementation here: https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0 By using a copy of the linear map, and passing the phys_to_virt offset into arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the page tables, and run with the MMU off as it does today. I'm convinced someone will crawl out of the woodwork screaming 'regression' if we substantially increase the amount of memory needed to kexec at all. From that discussion: this didn't meet Pavel's timing needs. If you depend on having all the src/dst pages lined up in a single line, it sounds like you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to start/stop that stuff is very implementation specific (and firmware configurable). I don't think we should let this rule out systems that can kexec today, but don't have enough extra memory for the page tables. Having two copies of the relocation code is obviously a bad idea. (as before: ) Instead of trying to make the relocations run quickly, can we reduce them? This would benefit other architectures too. Can the kexec core code allocate higher order pages, instead of doing everything page at at time? If you have a crash kernel reservation, can we use that to eliminate the relocations completely? (I think this suggestion has been lost in translation each time I make it. I mean like this: https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0 Runes to test it: | sudo ./kexec -p -u | sudo cat /proc/iomem | grep Crash | b0200000-f01fffff : Crash kernel | sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline I bet its even faster!) I think 'as fast as possible' and 'memory constrained' are mutually exclusive requirements. We need to make the page tables optional with a single implementation. Thanks, James
Hi James, > The problem I see with this is rewriting the relocation code. It needs to work whether the > machine has enough memory to enable the MMU during kexec, or not. > > In off-list mail to Pavel I proposed an alternative implementation here: > https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0 > > By using a copy of the linear map, and passing the phys_to_virt offset into > arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the > page tables, and run with the MMU off as it does today. > I'm convinced someone will crawl out of the woodwork screaming 'regression' if we > substantially increase the amount of memory needed to kexec at all. > > From that discussion: this didn't meet Pavel's timing needs. > If you depend on having all the src/dst pages lined up in a single line, it sounds like > you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to > start/stop that stuff is very implementation specific (and firmware configurable). > I don't think we should let this rule out systems that can kexec today, but don't have > enough extra memory for the page tables. > Having two copies of the relocation code is obviously a bad idea. I understand that having an extra set of page tables could potentially waste memory, especially if VAs are sparse, but in this case we use page tables exclusively for contiguous VA space (copy [src, src + size]). Therefore, the extra memory usage is tiny. The ratio for kernels with 4K page_size is (size of relocated memory) / 512. A normal initrd + kernel is usually under 64M, an extra space which means ~128K for the page table. Even with a huge relocation, where initrd is ~512M the extra memory usage in the worst case is just ~1M. I really doubt we will have any problem from users because of such small overhead in comparison to the total kexec-load size. > > > (as before: ) Instead of trying to make the relocations run quickly, can we reduce them? > This would benefit other architectures too. This was exactly my first approach [1] where I tried to pre-reserve memory similar to how it is done for a crash kernel, but I was asked to go away [2] as this is an ARM64 specific problem, where current relocation performance is prohibitively slow. I have tested on x86, and it does not suffer from this problem, relocation performance is just as fast as with MMU enabled ARM64. > > Can the kexec core code allocate higher order pages, instead of doing everything page at > at time? Yes, however, failures during kexec-load due to failure to coalesce huge pages can add extra hassle to users, and therefore this should be only an optimization with fallback to base pages. > > If you have a crash kernel reservation, can we use that to eliminate the relocations > completely? > (I think this suggestion has been lost in translation each time I make it. > I mean like this: > https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0 > Runes to test it: > | sudo ./kexec -p -u > | sudo cat /proc/iomem | grep Crash > | b0200000-f01fffff : Crash kernel > | sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline > > I bet its even faster!) There is a problem with this approach. While, with kexec_load() call it is possible to specify physical destinations for each segment, with kexec_file_load() it is not possible. The secure systems that do IMA checks during kexec load require kexec_file_load(), and we cannot ahead of time specify destinations for these segments (at least without substantially changing common kexec code which is not going to happen as this arm64 specific problem). > > > I think 'as fast as possible' and 'memory constrained' are mutually exclusive > requirements. We need to make the page tables optional with a single implementation. In my opinion having two different types of relocations will only add extra corner cases, confusion about different performance, and bugs. It is better to have two types: 1. crash kernel type without relocation, 2. fast relocation where MMU is enabled. [1] https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com [2] https://lore.kernel.org/lkml/20190710065953.GA4744@localhost.localdomain/ Thank you, Pasha
Pavel Tatashin <pasha.tatashin@soleen.com> writes: > Hi James, > >> The problem I see with this is rewriting the relocation code. It needs to work whether the >> machine has enough memory to enable the MMU during kexec, or not. >> >> In off-list mail to Pavel I proposed an alternative implementation here: >> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec+mmu/v0 >> >> By using a copy of the linear map, and passing the phys_to_virt offset into >> arm64_relocate_new_kernel() its possible to use the same code when we fail to allocate the >> page tables, and run with the MMU off as it does today. >> I'm convinced someone will crawl out of the woodwork screaming 'regression' if we >> substantially increase the amount of memory needed to kexec at all. >> >> From that discussion: this didn't meet Pavel's timing needs. >> If you depend on having all the src/dst pages lined up in a single line, it sounds like >> you've over-tuned this to depend on the CPU's streaming mode. What causes the CPU to >> start/stop that stuff is very implementation specific (and firmware configurable). >> I don't think we should let this rule out systems that can kexec today, but don't have >> enough extra memory for the page tables. >> Having two copies of the relocation code is obviously a bad idea. > > I understand that having an extra set of page tables could potentially > waste memory, especially if VAs are sparse, but in this case we use > page tables exclusively for contiguous VA space (copy [src, src + > size]). Therefore, the extra memory usage is tiny. The ratio for > kernels with 4K page_size is (size of relocated memory) / 512. A > normal initrd + kernel is usually under 64M, an extra space which > means ~128K for the page table. Even with a huge relocation, where > initrd is ~512M the extra memory usage in the worst case is just ~1M. > I really doubt we will have any problem from users because of such > small overhead in comparison to the total kexec-load size. Foolish question. Does arm64 have something like 2M pages that it can use for the linear map? On x86_64 we always generate page tables, because they are necessary to be in 64bit mode. As I recall on x86_64 we always use 2M pages which means for each 4K of page tables we map 1GiB of memory. Which is very tiny. If you do as well as x86_64 for arm64 I suspect that will be good enough for people to not claim regression. Would a variation on the x86_64 implementation that allocates page tables work for arm64? >> (as before: ) Instead of trying to make the relocations run quickly, can we reduce them? >> This would benefit other architectures too. > > This was exactly my first approach [1] where I tried to pre-reserve > memory similar to how it is done for a crash kernel, but I was asked > to go away [2] as this is an ARM64 specific problem, where current > relocation performance is prohibitively slow. I have tested on x86, > and it does not suffer from this problem, relocation performance is > just as fast as with MMU enabled ARM64. > >> >> Can the kexec core code allocate higher order pages, instead of doing everything page at >> at time? > > Yes, however, failures during kexec-load due to failure to coalesce > huge pages can add extra hassle to users, and therefore this should be > only an optimization with fallback to base pages. > >> >> If you have a crash kernel reservation, can we use that to eliminate the relocations >> completely? >> (I think this suggestion has been lost in translation each time I make it. >> I mean like this: >> https://gitlab.arm.com/linux-arm/linux-jm/-/tree/kexec/kexec_in_crashk/v0 >> Runes to test it: >> | sudo ./kexec -p -u >> | sudo cat /proc/iomem | grep Crash >> | b0200000-f01fffff : Crash kernel >> | sudo ./kexec --mem-min=0xb0200000 --mem-max=0xf01ffffff -l ~/Image --reuse-cmdline >> >> I bet its even faster!) > > There is a problem with this approach. While, with kexec_load() call > it is possible to specify physical destinations for each segment, with > kexec_file_load() it is not possible. The secure systems that do IMA > checks during kexec load require kexec_file_load(), and we cannot > ahead of time specify destinations for these segments (at least > without substantially changing common kexec code which is not going to > happen as this arm64 specific problem). >> I think 'as fast as possible' and 'memory constrained' are mutually exclusive >> requirements. We need to make the page tables optional with a single implementation. In my experience the slowdown with disabling a cpus cache (which apparently happens on arm64 when the MMU is disabled) is freakishly huge. Enabling the cache shouldn't be 'as fast as possible' but simply disengaging the parking brake. > In my opinion having two different types of relocations will only add > extra corner cases, confusion about different performance, and bugs. > It is better to have two types: 1. crash kernel type without > relocation, 2. fast relocation where MMU is enabled. > > [1] https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com > [2] https://lore.kernel.org/lkml/20190710065953.GA4744@localhost.localdomain/ As long as the page table provided is a linear mapping of physical memory (aka it looks like paging is disabled). The the code that relocates memory should be pretty much the same. My experience with other architectures suggests only a couple of instructions need to be different to deal with a MMU being enabled. Eric
> > I understand that having an extra set of page tables could potentially > > waste memory, especially if VAs are sparse, but in this case we use > > page tables exclusively for contiguous VA space (copy [src, src + > > size]). Therefore, the extra memory usage is tiny. The ratio for > > kernels with 4K page_size is (size of relocated memory) / 512. A > > normal initrd + kernel is usually under 64M, an extra space which > > means ~128K for the page table. Even with a huge relocation, where > > initrd is ~512M the extra memory usage in the worst case is just ~1M. > > I really doubt we will have any problem from users because of such > > small overhead in comparison to the total kexec-load size. Hi Eric, > > Foolish question. Thank you for your e-mail, you gave some interesting insights. > > Does arm64 have something like 2M pages that it can use for the > linear map? Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a choice of 16K and 64K and second level pages are bigger there. > On x86_64 we always generate page tables, because they are necessary to > be in 64bit mode. As I recall on x86_64 we always use 2M pages which > means for each 4K of page tables we map 1GiB of memory. Which is very > tiny. > > If you do as well as x86_64 for arm64 I suspect that will be good enough > for people to not claim regression. > > Would a variation on the x86_64 implementation that allocates page > tables work for arm64? ... > > As long as the page table provided is a linear mapping of physical > memory (aka it looks like paging is disabled). The the code that > relocates memory should be pretty much the same. > > My experience with other architectures suggests only a couple of > instructions need to be different to deal with a MMU being enabled. I think what you are proposing is similar to what James proposed. Yes, for a linear map relocation should be pretty much the same as we do relocation as with MMU disabled. Linear map still uses memory, because page tables must be outside of destination addresses of segments of the next kernel. Therefore, we must allocate a page table for the linear map. It might be a little smaller, but in reality the difference is small with 4K pages, and insignificant with 64K pages. The benefit of my approach is that the assembly copy loop is simpler, and allows hardware prefetching to work. The regular relocation loop works like this: for (entry = head; !(entry & IND_DONE); entry = *ptr++) { addr = __va(entry & PAGE_MASK); switch (entry & IND_FLAGS) { case IND_DESTINATION: dest = addr; break; case IND_INDIRECTION: ptr = addr; break; case IND_SOURCE: copy_page(dest, addr); dest += PAGE_SIZE; } } The entry for the next relocation page has to be always fetched, and therefore prefetching cannot help with the actual loop. In comparison, the loop that I am proposing is like this: for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE) copy_page(dest, addr); Here is assembly code for my loop: 1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 sub x11, x11, #PAGE_SIZE cbnz x11, 1b That said, if James and you agree that linear map is the way to go forward, I am OK with that as well, as it is still much better than having no caching at all. Thank you, Pasha
Pavel Tatashin <pasha.tatashin@soleen.com> writes: >> > I understand that having an extra set of page tables could potentially >> > waste memory, especially if VAs are sparse, but in this case we use >> > page tables exclusively for contiguous VA space (copy [src, src + >> > size]). Therefore, the extra memory usage is tiny. The ratio for >> > kernels with 4K page_size is (size of relocated memory) / 512. A >> > normal initrd + kernel is usually under 64M, an extra space which >> > means ~128K for the page table. Even with a huge relocation, where >> > initrd is ~512M the extra memory usage in the worst case is just ~1M. >> > I really doubt we will have any problem from users because of such >> > small overhead in comparison to the total kexec-load size. > > Hi Eric, > >> >> Foolish question. > > Thank you for your e-mail, you gave some interesting insights. > >> >> Does arm64 have something like 2M pages that it can use for the >> linear map? > > Yes, with 4K pages arm64 as well has 2M pages, but arm64 also has a > choice of 16K and 64K and second level pages are bigger there. >> On x86_64 we always generate page tables, because they are necessary to >> be in 64bit mode. As I recall on x86_64 we always use 2M pages which >> means for each 4K of page tables we map 1GiB of memory. Which is very >> tiny. >> >> If you do as well as x86_64 for arm64 I suspect that will be good enough >> for people to not claim regression. >> >> Would a variation on the x86_64 implementation that allocates page >> tables work for arm64? > ... >> >> As long as the page table provided is a linear mapping of physical >> memory (aka it looks like paging is disabled). The the code that >> relocates memory should be pretty much the same. >> >> My experience with other architectures suggests only a couple of >> instructions need to be different to deal with a MMU being enabled. > > I think what you are proposing is similar to what James proposed. Yes, > for a linear map relocation should be pretty much the same as we do > relocation as with MMU disabled. > > Linear map still uses memory, because page tables must be outside of > destination addresses of segments of the next kernel. Therefore, we > must allocate a page table for the linear map. It might be a little > smaller, but in reality the difference is small with 4K pages, and > insignificant with 64K pages. The benefit of my approach is that the > assembly copy loop is simpler, and allows hardware prefetching to > work. > > The regular relocation loop works like this: > > for (entry = head; !(entry & IND_DONE); entry = *ptr++) { > addr = __va(entry & PAGE_MASK); > > switch (entry & IND_FLAGS) { > case IND_DESTINATION: > dest = addr; > break; > case IND_INDIRECTION: > ptr = addr; > break; > case IND_SOURCE: > copy_page(dest, addr); > dest += PAGE_SIZE; > } > } > > The entry for the next relocation page has to be always fetched, and > therefore prefetching cannot help with the actual loop. True. In the common case the loop looks like: > for (entry = head; !(entry & IND_DONE); entry = *ptr++) { > addr = __va(entry & PAGE_MASK); > > switch (entry & IND_FLAGS) { > case IND_SOURCE: > copy_page(dest, addr); > dest += PAGE_SIZE; > } > } Which is a read of the source address followed by the copy_page. I suspect the overhead of that loop is small enough that it swamped by the cost of the copy_page. If not and a better data structure can be proposed we can look at that. > In comparison, the loop that I am proposing is like this: > > for (addr = head; addr < end; addr += PAGE_SIZE, dst += PAGE_SIZE) > copy_page(dest, addr); > > Here is assembly code for my loop: > > 1: copy_page x1, x2, x3, x4, x5, x6, x7, x8, x9, x10 > sub x11, x11, #PAGE_SIZE > cbnz x11, 1b I think you may be hiding the cost of that loop in the page table fetches themselves. It is possible though unlikely that a page table with huge pages (and thus smaller page fault costs) and the original loop is actually cheaper. > That said, if James and you agree that linear map is the way to go > forward, I am OK with that as well, as it is still much better than > having no caching at all. The big advantage of a linear map is that the kexec'd code can continue to use it until it sets up it's own page tables. I probably did not document it well enough but a linear map then equivalent of not having virtual addresses at all was always my intention for the hand-off state of kexec between kernels. So please try the linear map. If it is noticably slower than your optimized page table give numbers and we can see if there is a way to improve the generic kexec data structures. Eric