Message ID | 20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: arm64: Implement SW/HW combined dirty log | expand |
Hi Shameer, On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote: > Hi, > > This is to revive the RFC series[1], which makes use of hardware dirty > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent > out by Zhu Keqian sometime back. > > One of the main drawbacks in using the hardware DBM feature for dirty > page tracking is the additional overhead in scanning the PTEs for dirty > pages[2]. Also there are no vCPU page faults when we set the DBM bit, > which may result in higher convergence time during guest migration. > > This series tries to reduce these overheads by not setting the > DBM for all the writeable pages during migration and instead uses a > combined software(current page fault mechanism) and hardware approach > (set DBM) for dirty page tracking. > > As noted in RFC v1[1], > "The core idea is that we do not enable hardware dirty at start (do not > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all > PTs with hardware dirty enabled, so we do not need to scan all PTs." I'm unconvinced of the value of such a change. What you're proposing here is complicated and I fear not easily maintainable. Keeping the *two* sources of dirty state seems likely to fail (eventually) with some very unfortunate consequences. The optimization of enabling DBM on neighboring PTEs is presumptive of the guest access pattern and could incur unnecessary scans of the stage-2 page table w/ a sufficiently sparse guest access pattern. > Tests with dirty_log_perf_test with anonymous THP pages shows significant > improvement in "dirty memory time" as expected but with a hit on > "get dirty time" . > > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp > > +---------------------------+----------------+------------------+ > | | 6.5-rc5 | 6.5-rc5 + series | > | | (s) | (s) | > +---------------------------+----------------+------------------+ > | dirty memory time | 4.22 | 0.41 | > | get dirty log time | 0.00047 | 3.25 | > | clear dirty log time | 0.48 | 0.98 | > +---------------------------------------------------------------+ The vCPU:memory ratio you're testing doesn't seem representative of what a typical cloud provider would be configuring, and the dirty log collection is going to scale linearly with the size of guest memory. Slow dirty log collection is going to matter a lot for VM blackout, which from experience tends to be the most sensitive period of live migration for guest workloads. At least in our testing, the split GET/CLEAR dirty log ioctls dramatically improved the performance of a write-protection based ditry tracking scheme, as the false positive rate for dirtied pages is significantly reduced. FWIW, this is what we use for doing LM on arm64 as opposed to the D-bit implemenation that we use on x86. > In order to get some idea on actual live migration performance, > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and > while the test was in progress initiated live migration(local). > > redis-benchmark -t set -c 900 -n 5000000 --threads 96 > > Average of 5 runs shows that benchmark finishes ~10% faster with > a ~8% increase in "total time" for migration. > > +---------------------------+----------------+------------------+ > | | 6.5-rc5 | 6.5-rc5 + series | > | | (s) | (s) | > +---------------------------+----------------+------------------+ > | [redis]5000000 requests in| 79.428 | 71.49 | > | [info migrate]total time | 8438 | 9097 | > +---------------------------------------------------------------+ Faster pre-copy performance would help the benchmark complete faster, but the goal for a live migration should be to minimize the lost computation for the entire operation. You'd need to test with a continuous workload rather than one with a finite amount of work. Also, do you know what live migration scheme you're using here?
Hi Oliver, > -----Original Message----- > From: Oliver Upton [mailto:oliver.upton@linux.dev] > Sent: 13 September 2023 18:30 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org; > linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org; > catalin.marinas@arm.com; james.morse@arm.com; > suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian > <zhukeqian1@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com> > Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined > dirty log > > Hi Shameer, > > On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote: > > Hi, > > > > This is to revive the RFC series[1], which makes use of hardware dirty > > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent > > out by Zhu Keqian sometime back. > > > > One of the main drawbacks in using the hardware DBM feature for dirty > > page tracking is the additional overhead in scanning the PTEs for dirty > > pages[2]. Also there are no vCPU page faults when we set the DBM bit, > > which may result in higher convergence time during guest migration. > > > > This series tries to reduce these overheads by not setting the > > DBM for all the writeable pages during migration and instead uses a > > combined software(current page fault mechanism) and hardware > approach > > (set DBM) for dirty page tracking. > > > > As noted in RFC v1[1], > > "The core idea is that we do not enable hardware dirty at start (do not > > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking > > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add > > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all > > PTs with hardware dirty enabled, so we do not need to scan all PTs." > > I'm unconvinced of the value of such a change. > > What you're proposing here is complicated and I fear not easily > maintainable. Keeping the *two* sources of dirty state seems likely to > fail (eventually) with some very unfortunate consequences. It does adds complexity to the dirty state management code. I have tried to separate the code path using appropriate FLAGS etc to make it more manageable. But this is probably one area we can work on if the overall approach does have some benefits. > The optimization of enabling DBM on neighboring PTEs is presumptive of > the guest access pattern and could incur unnecessary scans of the > stage-2 page table w/ a sufficiently sparse guest access pattern. Agree. This may not work as intended for all workloads and especially if the access pattern is sparse. But still hopeful that it will be beneficial for workloads that have continuous write patterns. And we do have a knob to turn it on or off. > > Tests with dirty_log_perf_test with anonymous THP pages shows > significant > > improvement in "dirty memory time" as expected but with a hit on > > "get dirty time" . > > > > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp > > > > +---------------------------+----------------+------------------+ > > | | 6.5-rc5 | 6.5-rc5 + series > | > > > | | (s) | (s) > | > > +---------------------------+----------------+------------------+ > > | dirty memory > time | 4.22 | 0.41 | > > | get dirty log > time | 0.00047 | 3.25 | > > | clear dirty log > time | 0.48 | 0.98 | > > +---------------------------------------------------------------+ > > The vCPU:memory ratio you're testing doesn't seem representative of what > a typical cloud provider would be configuring, and the dirty log > collection is going to scale linearly with the size of guest memory. I was limited by the test setup I had. I will give it a go with a higher mem system. > Slow dirty log collection is going to matter a lot for VM blackout, > which from experience tends to be the most sensitive period of live > migration for guest workloads. > > At least in our testing, the split GET/CLEAR dirty log ioctls > dramatically improved the performance of a write-protection based ditry > tracking scheme, as the false positive rate for dirtied pages is > significantly reduced. FWIW, this is what we use for doing LM on arm64 as > opposed to the D-bit implemenation that we use on x86. Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is something we lack on ARM yet. > > In order to get some idea on actual live migration performance, > > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and > > while the test was in progress initiated live migration(local). > > > > redis-benchmark -t set -c 900 -n 5000000 --threads 96 > > > > Average of 5 runs shows that benchmark finishes ~10% faster with > > a ~8% increase in "total time" for migration. > > > > +---------------------------+----------------+------------------+ > > | | 6.5-rc5 | 6.5-rc5 + series > | > > > | | (s) | (s) > | > > +---------------------------+----------------+------------------+ > > | [redis]5000000 requests in| 79.428 | 71.49 | > > | [info migrate]total > time | 8438 | 9097 | > > +---------------------------------------------------------------+ > > Faster pre-copy performance would help the benchmark complete faster, > but the goal for a live migration should be to minimize the lost > computation for the entire operation. You'd need to test with a > continuous workload rather than one with a finite amount of work. Ok. Though the above is not representative of a real workload, I thought it gives some idea on how "Guest up time improvement" is benefitting the overall availability of the workload during migration. I will check within our wider team to see if I can setup a more suitable test/workload to show some improvement with this approach. Please let me know if there is a specific workload you have in mind. > Also, do you know what live migration scheme you're using here? The above is the default one (pre-copy). Thanks for getting back on this. Appreciate if you can do a quick glance through the rest of the patches as well for any gross errors especially with respect to page table walk locking, usage of DBM FLAGS etc. Thanks, Shameer
On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi wrote: [...] > > What you're proposing here is complicated and I fear not easily > > maintainable. Keeping the *two* sources of dirty state seems likely to > > fail (eventually) with some very unfortunate consequences. > > It does adds complexity to the dirty state management code. I have tried > to separate the code path using appropriate FLAGS etc to make it more > manageable. But this is probably one area we can work on if the overall > approach does have some benefits. I'd be a bit more amenable to a solution that would select either write-protection or dirty state management, but not both. > > The vCPU:memory ratio you're testing doesn't seem representative of what > > a typical cloud provider would be configuring, and the dirty log > > collection is going to scale linearly with the size of guest memory. > > I was limited by the test setup I had. I will give it a go with a higher mem > system. Thanks. Dirty log collection needn't be single threaded, but the fundamental concern of dirty log collection time scaling linearly w.r.t. the size to memory remains. Write-protection helps spread the cost of collecting dirty state out across all the vCPU threads. There could be some value in giving userspace the ability to parallelize calls to dirty log ioctls to work on non-intersecting intervals. > > Slow dirty log collection is going to matter a lot for VM blackout, > > which from experience tends to be the most sensitive period of live > > migration for guest workloads. > > > > At least in our testing, the split GET/CLEAR dirty log ioctls > > dramatically improved the performance of a write-protection based ditry > > tracking scheme, as the false positive rate for dirtied pages is > > significantly reduced. FWIW, this is what we use for doing LM on arm64 as > > opposed to the D-bit implemenation that we use on x86. > > Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is > something we lack on ARM yet. Sorry, this was rather nonspecific. I was describing the pre-copy strategies we're using at Google (out of tree). We're carrying patches to use EPT D-bit for exitless dirty tracking. > > Faster pre-copy performance would help the benchmark complete faster, > > but the goal for a live migration should be to minimize the lost > > computation for the entire operation. You'd need to test with a > > continuous workload rather than one with a finite amount of work. > > Ok. Though the above is not representative of a real workload, I thought > it gives some idea on how "Guest up time improvement" is benefitting the > overall availability of the workload during migration. I will check within our > wider team to see if I can setup a more suitable test/workload to show some > improvement with this approach. > > Please let me know if there is a specific workload you have in mind. No objection to the workload you've chosen, I'm more concerned about the benchmark finishing before live migration completes. What I'm looking for is something like this: - Calculate the ops/sec your benchmark completes in steady state - Do a live migration and sample the rate throughout the benchmark, accounting for VM blackout time - Calculate the area under the curve of: y = steady_state_rate - live_migration_rate(t) - Compare the area under the curve for write-protection and your DBM approach. > Thanks for getting back on this. Appreciate if you can do a quick glance > through the rest of the patches as well for any gross errors especially with > respect to page table walk locking, usage of DBM FLAGS etc. I'll give it a read when I have some spare cycles. To be entirely clear, I don't have any fundamental objections to using DBM for dirty tracking. I just want to make sure that all alternatives have been considered in the current scheme before we seriously consider a new approach with its own set of tradeoffs.
> -----Original Message----- > From: Oliver Upton [mailto:oliver.upton@linux.dev] > Sent: 15 September 2023 01:36 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org; > linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org; > catalin.marinas@arm.com; james.morse@arm.com; > suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian > <zhukeqian1@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com> > Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined > dirty log > > On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi > wrote: > > [...] > > > > What you're proposing here is complicated and I fear not easily > > > maintainable. Keeping the *two* sources of dirty state seems likely to > > > fail (eventually) with some very unfortunate consequences. > > > > It does adds complexity to the dirty state management code. I have tried > > to separate the code path using appropriate FLAGS etc to make it more > > manageable. But this is probably one area we can work on if the overall > > approach does have some benefits. > > I'd be a bit more amenable to a solution that would select either > write-protection or dirty state management, but not both. > > > > The vCPU:memory ratio you're testing doesn't seem representative of > what > > > a typical cloud provider would be configuring, and the dirty log > > > collection is going to scale linearly with the size of guest memory. > > > > I was limited by the test setup I had. I will give it a go with a higher mem > > system. > > Thanks. Dirty log collection needn't be single threaded, but the > fundamental concern of dirty log collection time scaling linearly w.r.t. > the size to memory remains. Write-protection helps spread the cost of > collecting dirty state out across all the vCPU threads. > > There could be some value in giving userspace the ability to parallelize > calls to dirty log ioctls to work on non-intersecting intervals. > > > > Slow dirty log collection is going to matter a lot for VM blackout, > > > which from experience tends to be the most sensitive period of live > > > migration for guest workloads. > > > > > > At least in our testing, the split GET/CLEAR dirty log ioctls > > > dramatically improved the performance of a write-protection based ditry > > > tracking scheme, as the false positive rate for dirtied pages is > > > significantly reduced. FWIW, this is what we use for doing LM on arm64 > as > > > opposed to the D-bit implemenation that we use on x86. > > > > Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is > > something we lack on ARM yet. > > Sorry, this was rather nonspecific. I was describing the pre-copy > strategies we're using at Google (out of tree). We're carrying patches > to use EPT D-bit for exitless dirty tracking. Just curious, how does it handle the overheads associated with scanning for dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? > > > Faster pre-copy performance would help the benchmark complete faster, > > > but the goal for a live migration should be to minimize the lost > > > computation for the entire operation. You'd need to test with a > > > continuous workload rather than one with a finite amount of work. > > > > Ok. Though the above is not representative of a real workload, I thought > > it gives some idea on how "Guest up time improvement" is benefitting the > > overall availability of the workload during migration. I will check within our > > wider team to see if I can setup a more suitable test/workload to show > some > > improvement with this approach. > > > > Please let me know if there is a specific workload you have in mind. > > No objection to the workload you've chosen, I'm more concerned about the > benchmark finishing before live migration completes. > > What I'm looking for is something like this: > > - Calculate the ops/sec your benchmark completes in steady state > > - Do a live migration and sample the rate throughout the benchmark, > accounting for VM blackout time > > - Calculate the area under the curve of: > > y = steady_state_rate - live_migration_rate(t) > > - Compare the area under the curve for write-protection and your DBM > approach. Ok. Got it. > > Thanks for getting back on this. Appreciate if you can do a quick glance > > through the rest of the patches as well for any gross errors especially with > > respect to page table walk locking, usage of DBM FLAGS etc. > > I'll give it a read when I have some spare cycles. To be entirely clear, > I don't have any fundamental objections to using DBM for dirty tracking. > I just want to make sure that all alternatives have been considered > in the current scheme before we seriously consider a new approach with > its own set of tradeoffs. Thanks for taking a look. Shameer
On Mon, Sep 18, 2023 at 09:55:22AM +0000, Shameerali Kolothum Thodi wrote: [...] > > Sorry, this was rather nonspecific. I was describing the pre-copy > > strategies we're using at Google (out of tree). We're carrying patches > > to use EPT D-bit for exitless dirty tracking. > > Just curious, how does it handle the overheads associated with scanning for > dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? A pool of kthreads, which really isn't a good solution at all. The 'better' way to do it would be to add some back pressure to the guest such that your pre-copy transfer can converge with the guest and use the freed up CPU time to manage the dirty state. But hopefully we can make that a userspace issue.
Hi, > -----Original Message----- > From: linux-arm-kernel > [mailto:linux-arm-kernel-bounces@lists.infradead.org] On Behalf Of > Shameerali Kolothum Thodi > Sent: 18 September 2023 10:55 > To: Oliver Upton <oliver.upton@linux.dev> > Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org; > linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org; > catalin.marinas@arm.com; james.morse@arm.com; > suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian > <zhukeqian1@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com> > Subject: RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined > dirty log [...] > > > Please let me know if there is a specific workload you have in mind. > > > > No objection to the workload you've chosen, I'm more concerned about > the > > benchmark finishing before live migration completes. > > > > What I'm looking for is something like this: > > > > - Calculate the ops/sec your benchmark completes in steady state > > > > - Do a live migration and sample the rate throughout the benchmark, > > accounting for VM blackout time > > > > - Calculate the area under the curve of: > > > > y = steady_state_rate - live_migration_rate(t) > > > > - Compare the area under the curve for write-protection and your DBM > > approach. > > Ok. Got it. I attempted to benchmark the performance of this series better as suggested above. Used memcached/memaslap instead of redis-benchmark as this tool seems to dirty memory at a faster rate than redis-benchmark in my setup. ./memaslap -s 127.0.0.1:11211 -S 1s -F ./memslap.cnf -T 96 -c 96 -t 20m Please find the google sheet link below for the charts that compare the average throughput rates during the migration time window for 6.5-org and 6.5-kvm-dbm branch. https://docs.google.com/spreadsheets/d/1T2F94Lsjpx080hW8OSxwbTJXihbXDNlTE1HjWCC0J_4/edit?usp=sharing Sheet #1 : is with autoconverge=on with default settings(initial-throttle 20 & increment 10). As you can see from the charts, if you compare the kvm-dbm branch throughput during the migration window of original branch, it is considerably higher. But the convergence time to finish migration increases almost at the same rate for KVM-DBM. This in effect results in a decreased overall avg. throughput if we compare with the same time window of original branch. Sheet #2: is with autoconverge=on with throttle-increment set to 15 for kvm-dbm branch run. However, if we increase the migration throttling rate for kvm-dbm branch, it looks to me we can still have better throughput during the migration window time and also an overall higher throughput rate with KVM-DBM solution. Sheet: #3. Captures the dirty_log_perf_test times vs memory per vCPU. This is also in line with the above results. KVM-DBM has better/constant-ish dirty memory time compared to linear increase noted for original. But it is just the opposite for Get Dirty log time. From the above, it looks to me there is a value addition in using HW DBM for write intensive workloads if we adjust the CPU throttling in the user space. Please take a look and let me know your feedback/thoughts. Thanks, Shameer