Message ID | 1598273705-69124-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | per memcg lru_lock | expand |
On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > The new version which bases on v5.9-rc2. The first 6 patches was picked into > linux-mm, and add patch 25-32 that do some further post optimization. 32 patches, version 18. That's quite heroic. I'm unsure whether I should merge it up at this point - what do people think? > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > containers on a 2s * 26cores * HT box with a modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > With this patchset, the readtwice performance increased about 80% > in concurrent containers. That's rather a slight amount of performance testing for a huge performance patchset! Is more detailed testing planned?
On Mon, 24 Aug 2020, Andrew Morton wrote: > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > The new version which bases on v5.9-rc2. Well timed and well based, thank you Alex. Particulary helpful to me, to include those that already went into mmotm: it's a surer foundation to test on top of the -rc2 base. > > the first 6 patches was picked into > > linux-mm, and add patch 25-32 that do some further post optimization. > > 32 patches, version 18. That's quite heroic. I'm unsure whether I > should merge it up at this point - what do people think? I'd love for it to go into mmotm - but not today. Version 17 tested out well. I've only just started testing version 18, but I'm afraid there's been a number of "improvements" in between, which show up as warnings (lots of VM_WARN_ON_ONCE_PAGE(!memcg) - I think one or more of those are already in mmotm and under discussion on the list, but I haven't read through yet, and I may have caught more cases to examine; a per-cpu warning from munlock_vma_page(); something else flitted by at reboot time before I could read it). No crashes so far, but I haven't got very far with it yet. I'll report back later in the week. Andrew demurred on version 17 for lack of review. Alexander Duyck has been doing a lot on that front since then. I have intended to do so, but it's a mirage that moves away from me as I move towards it: I have some time in the coming weeks to get back to that, but it would help me if the series is held more static by being in mmotm - we may need fixes, but improvements are liable to get in the way of finalizing. I still find the reliance on TestClearPageLRU, rather than lru_lock, hard to wrap my head around: but for so long as it's working correctly, please take that as a problem with my head (and something we can certainly change later if necessary, by re-adding the use of lru_lock in certain places (or by fitting me with a new head)). > > > > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > > containers on a 2s * 26cores * HT box with a modefied case: > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > With this patchset, the readtwice performance increased about 80% > > in concurrent containers. > > That's rather a slight amount of performance testing for a huge > performance patchset! Indeed. And I see that clause about readtwice performance increased 80% going back eight months to v6: a lot of fundamental bugs have been fixed in it since then, so I do think it needs refreshing. It could be faster now: v16 or v17 fixed the last bug I knew of, which had been slowing down reclaim considerably. When I last timed my repetitive swapping loads (not loads anyone sensible would be running with), across only two memcgs, Alex's patchset was slightly faster than without: it really did make a difference. But I tend to think that for all patchsets, there exists at least one test that shows it faster, and another that shows it slower. > Is more detailed testing planned? Not by me, performance testing is not something I trust myself with, just get lost in the numbers: Alex, this is what we hoped for months ago, please make a more convincing case, I hope Daniel and others can make more suggestions. But my own evidence suggests it's good. Hugh
On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote: > On Mon, 24 Aug 2020, Andrew Morton wrote: > > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > Andrew demurred on version 17 for lack of review. Alexander Duyck has > been doing a lot on that front since then. I have intended to do so, > but it's a mirage that moves away from me as I move towards it: I have Same, I haven't been able to keep up with the versions or the recent review feedback. I got through about half of v17 last week and hope to have more time for the rest this week and beyond. > > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > > > containers on a 2s * 26cores * HT box with a modefied case: Alex, do you have a pointer to the modified readtwice case? Even better would be a description of the problem you're having in production with lru_lock. We might be able to create at least a simulation of it to show what the expected improvement of your real workload is. > > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > > With this patchset, the readtwice performance increased about 80% > > > in concurrent containers. > > > > That's rather a slight amount of performance testing for a huge > > performance patchset! > > Indeed. And I see that clause about readtwice performance increased 80% > going back eight months to v6: a lot of fundamental bugs have been fixed > in it since then, so I do think it needs refreshing. It could be faster > now: v16 or v17 fixed the last bug I knew of, which had been slowing > down reclaim considerably. > > When I last timed my repetitive swapping loads (not loads anyone sensible > would be running with), across only two memcgs, Alex's patchset was > slightly faster than without: it really did make a difference. But > I tend to think that for all patchsets, there exists at least one > test that shows it faster, and another that shows it slower. > > > Is more detailed testing planned? > > Not by me, performance testing is not something I trust myself with, > just get lost in the numbers: Alex, this is what we hoped for months > ago, please make a more convincing case, I hope Daniel and others > can make more suggestions. But my own evidence suggests it's good. I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from mmtests, a memcg-ized version of the readtwice case I cooked up) and then today discovered there's a chance I wasn't running the right kernels, so I'm redoing them on v18. Plan to look into what other, more "macro" tests would be sensitive to these changes.
在 2020/8/25 上午9:56, Daniel Jordan 写道: > On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote: >> On Mon, 24 Aug 2020, Andrew Morton wrote: >>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: >> Andrew demurred on version 17 for lack of review. Alexander Duyck has >> been doing a lot on that front since then. I have intended to do so, >> but it's a mirage that moves away from me as I move towards it: I have > > Same, I haven't been able to keep up with the versions or the recent review > feedback. I got through about half of v17 last week and hope to have more time > for the rest this week and beyond. > >>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 >>>> containers on a 2s * 26cores * HT box with a modefied case: > > Alex, do you have a pointer to the modified readtwice case? Sorry, no. my developer machine crashed, so I lost case my container and modified case. I am struggling to get my container back from a account problematic repository. But some testing scripts is here, generally, the original readtwice case will run each of threads on each of cpus. The new case will run one container on each cpus, and just run one readtwice thead in each of containers. Here is readtwice case changes(Just a reference) diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice index 85533b248634..48c6b5f44256 100755 --- a/case-lru-file-readtwice +++ b/case-lru-file-readtwice @@ -15,12 +15,9 @@ . ./hw_vars -for i in `seq 1 $nr_task` -do create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task)) timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 & timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 & -done wait sleep 1 @@ -31,7 +28,7 @@ do echo "dd output file empty: $file" >&2 } cat $file - rm $file + #rm $file done rm `seq -f $SPARSE_FILE-%g 1 $nr_task` And here is how to running the case: -------- #run all case on 24 cpu machine, lrulockv2 is the container with modified case. for ((i=0; i<24; i++)) do #btw, vm-scalability need create 23 loop devices docker run --privileged=true --rm lrulockv2 bash -c " sleep 20000" & done sleep 15 #wait all container ready. #kick testing for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./run case-lru-file-readtwice "& done #show testing result for all for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* ' & done for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* ' & done | grep MB | awk 'BEGIN {a=0 ;} { a+=$8} END {print NR, a/(NR)}' > > Even better would be a description of the problem you're having in production > with lru_lock. We might be able to create at least a simulation of it to show > what the expected improvement of your real workload is. we are using thousands memcgs in a machine, but as a simulation, I guess above case could be helpful to show the problem. Thanks a lot! Alex > >>>> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice >>>> With this patchset, the readtwice performance increased about 80% >>>> in concurrent containers. >>> >>> That's rather a slight amount of performance testing for a huge >>> performance patchset! >> >> Indeed. And I see that clause about readtwice performance increased 80% >> going back eight months to v6: a lot of fundamental bugs have been fixed >> in it since then, so I do think it needs refreshing. It could be faster >> now: v16 or v17 fixed the last bug I knew of, which had been slowing >> down reclaim considerably. >> >> When I last timed my repetitive swapping loads (not loads anyone sensible >> would be running with), across only two memcgs, Alex's patchset was >> slightly faster than without: it really did make a difference. But >> I tend to think that for all patchsets, there exists at least one >> test that shows it faster, and another that shows it slower. In my testing, case-lru-file-mmap-read has a bit slower, 10+% on 96 thread machine, when memcg is enabled but unused, that may due to longer pointer jumpping on lruvec than pgdat->lru_lock, since cgroup_disable=memory could fully remove the regression with the new lock path. I tried reusing page->prviate to store lruvec pointer, that could remove some regression on this, since private is generally unused on a lru page. But the patch is too buggy now. BTW, Guess memcg would cause more memory disturb on a large machine, if it's enabled but unused, isn't it? >> >>> Is more detailed testing planned? >> >> Not by me, performance testing is not something I trust myself with, >> just get lost in the numbers: Alex, this is what we hoped for months >> ago, please make a more convincing case, I hope Daniel and others >> can make more suggestions. But my own evidence suggests it's good. > > I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from > mmtests, a memcg-ized version of the readtwice case I cooked up) and then today > discovered there's a chance I wasn't running the right kernels, so I'm redoing > them on v18. Plan to look into what other, more "macro" tests would be > sensitive to these changes. >
On Mon 24-08-20 11:42:04, Andrew Morton wrote: > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > The new version which bases on v5.9-rc2. The first 6 patches was picked into > > linux-mm, and add patch 25-32 that do some further post optimization. > > 32 patches, version 18. That's quite heroic. I'm unsure whether I > should merge it up at this point - what do people think? This really needs a proper review. Unfortunately : 24 files changed, 646 insertions(+), 443 deletions(-) is quite an undertaking to review as well. Especially in a tricky code which is full of surprises. I do agree that per memcg locking looks like a nice feature but I do not see any pressing reason to merge it ASAP. The cover letter doesn't really describe any pressing usecase that cannot really live without this being merged. I am fully aware of my dept to review but I simply cannot find enough time to sit on it and think it through to have a meaningful feedback at this moment. > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > > containers on a 2s * 26cores * HT box with a modefied case: > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > With this patchset, the readtwice performance increased about 80% > > in concurrent containers. > > That's rather a slight amount of performance testing for a huge > performance patchset! Is more detailed testing planned? Agreed! This needs much better testing coverage.
在 2020/8/25 上午9:56, Daniel Jordan 写道: > On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote: >> On Mon, 24 Aug 2020, Andrew Morton wrote: >>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: >> Andrew demurred on version 17 for lack of review. Alexander Duyck has >> been doing a lot on that front since then. I have intended to do so, >> but it's a mirage that moves away from me as I move towards it: I have > > Same, I haven't been able to keep up with the versions or the recent review > feedback. I got through about half of v17 last week and hope to have more time > for the rest this week and beyond. > >>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 >>>> containers on a 2s * 26cores * HT box with a modefied case: > > Alex, do you have a pointer to the modified readtwice case? > Hi Daniel, my readtwice modification like below. diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice index 85533b248634..57cb97d121ae 100755 --- a/case-lru-file-readtwice +++ b/case-lru-file-readtwice @@ -15,23 +15,30 @@ . ./hw_vars -for i in `seq 1 $nr_task` -do - create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task)) - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 & - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 & -done +OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g +TEST_CASES=${@:-$(echo case-*)} + +echo $((1<<30)) > /proc/sys/vm/max_map_count +echo $((1<<20)) > /proc/sys/kernel/threads-max +echo 1 > /proc/sys/vm/overcommit_memory +#echo 3 > /proc/sys/vm/drop_caches + + +i=1 + +if [ "$1" == "m" ];then + mount_tmpfs + create_sparse_root + create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES)) + exit +fi + + +if [ "$1" == "r" ];then + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)& + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)& +fi wait sleep 1 -for file in $TMPFS_MNT/dd-output-* -do - [ -s "$file" ] || { - echo "dd output file empty: $file" >&2 - } - cat $file - rm $file -done - -rm `seq -f $SPARSE_FILE-%g 1 $nr_task` diff --git a/hw_vars b/hw_vars index 8731cefb9f57..ceeaa9f17c0b 100755 --- a/hw_vars +++ b/hw_vars @@ -1,4 +1,4 @@ -#!/bin/sh +#!/bin/sh -ex if [ -n "$runtime" ]; then USEMEM="$CMD ./usemem --runtime $runtime" @@ -43,7 +43,7 @@ create_loop_devices() modprobe loop 2>/dev/null [ -e "/dev/loop0" ] || modprobe loop 2>/dev/null - for i in $(seq 0 8) + for i in $(seq 0 104) do [ -e "/dev/loop$i" ] && continue mknod /dev/loop$i b 7 $i
On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote: > I tried reusing page->prviate to store lruvec pointer, that could remove some > regression on this, since private is generally unused on a lru page. But the patch > is too buggy now. page->private is for the use of the filesystem. You can grep for attach_page_private() to see how most filesystems use it. Some still use set_page_private() for various reasons.
在 2020/8/25 下午4:52, Alex Shi 写道: > > 在 2020/8/25 上午9:56, Daniel Jordan 写道: >> On Mon, Aug 24, 2020 at 01:24:20PM -0700, Hugh Dickins wrote: >>> On Mon, 24 Aug 2020, Andrew Morton wrote: >>>> On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: >>> Andrew demurred on version 17 for lack of review. Alexander Duyck has >>> been doing a lot on that front since then. I have intended to do so, >>> but it's a mirage that moves away from me as I move towards it: I have >> Same, I haven't been able to keep up with the versions or the recent review >> feedback. I got through about half of v17 last week and hope to have more time >> for the rest this week and beyond. >> >>>>> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 >>>>> containers on a 2s * 26cores * HT box with a modefied case: >> Alex, do you have a pointer to the modified readtwice case? >> > Hi Daniel, > > my readtwice modification like below. > > diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice Hi Diniel, I finally settle down my container, and found I give a different version of my scripts which can't work out together. I am sorry! I will try to bring them up together. and try to give a new version. Thanks a lot! Alex
On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote: > 在 2020/8/25 上午9:56, Daniel Jordan 写道: > > Alex, do you have a pointer to the modified readtwice case? > > Sorry, no. my developer machine crashed, so I lost case my container and modified > case. I am struggling to get my container back from a account problematic repository. > > But some testing scripts is here, generally, the original readtwice case will > run each of threads on each of cpus. The new case will run one container on each cpus, > and just run one readtwice thead in each of containers. Ok, what you've sent so far gives me an idea of what you did. My readtwice changes were similar, except I used the cgroup interface directly instead of docker and shared a filesystem between all the cgroups whereas it looks like you had one per memcg. 30 second runs on 5.9-rc2 and v18 gave 11% more data read with v18. This was using 16 cgroups (32 dd tasks) on a 40 CPU, 2 socket machine. > > Even better would be a description of the problem you're having in production > > with lru_lock. We might be able to create at least a simulation of it to show > > what the expected improvement of your real workload is. > > we are using thousands memcgs in a machine, but as a simulation, I guess above case > could be helpful to show the problem. Using thousands of memcgs to do what? Any particulars about the type of workload? Surely it's more complicated than page cache reads :) > > I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from > > mmtests, a memcg-ized version of the readtwice case I cooked up) and then today > > discovered there's a chance I wasn't running the right kernels, so I'm redoing > > them on v18. Neither kernel compile nor git checkout in the root cgroup changed much, just 0.31% slower on elapsed time for the compile, so no significant regressions there. Now for sysbench again.
在 2020/8/26 上午9:19, Daniel Jordan 写道: > On Tue, Aug 25, 2020 at 11:26:58AM +0800, Alex Shi wrote: >> 在 2020/8/25 上午9:56, Daniel Jordan 写道: >>> Alex, do you have a pointer to the modified readtwice case? >> >> Sorry, no. my developer machine crashed, so I lost case my container and modified >> case. I am struggling to get my container back from a account problematic repository. >> >> But some testing scripts is here, generally, the original readtwice case will >> run each of threads on each of cpus. The new case will run one container on each cpus, >> and just run one readtwice thead in each of containers. > > Ok, what you've sent so far gives me an idea of what you did. My readtwice > changes were similar, except I used the cgroup interface directly instead of > docker and shared a filesystem between all the cgroups whereas it looks like > you had one per memcg. 30 second runs on 5.9-rc2 and v18 gave 11% more data > read with v18. This was using 16 cgroups (32 dd tasks) on a 40 CPU, 2 socket > machine. I clean up my testing and make it reproducable by a Dockerfile and a case patch which attached. User can build a container from the file, and then do testing like following: #start some testing containers for ((i=0; i< 80; i++)); do docker run --privileged=true --rm lrulock bash -c " sleep 20000" & done #do testing evn setup for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./case-lru-file-readtwice m"& done #kick testing for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do docker exec --privileged=true -it $i bash -c "cd vm-scalability/; bash -x ./case-lru-file-readtwice r"& done #show result for i in `docker ps | sed '1 d' | awk '{print $1 }'` ;do echo === $i ===; docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-* ' & done | grep MB | awk 'BEGIN {a=0;} { a+=$10 } END {print NR, a/(NR)}' This time, on a 2P * 20 core * 2 HT machine, This readtwice performance is 252% compare to v5.9-rc2 kernel. A good surprise! > >>> Even better would be a description of the problem you're having in production >>> with lru_lock. We might be able to create at least a simulation of it to show >>> what the expected improvement of your real workload is. >> >> we are using thousands memcgs in a machine, but as a simulation, I guess above case >> could be helpful to show the problem. > > Using thousands of memcgs to do what? Any particulars about the type of > workload? Surely it's more complicated than page cache reads :) Yes, the workload are quit different on different business, some use cpu a lot, some use memory a lot, and some are may mixed. For containers number, that are also quit various from tens to hundreds to thousands. > >>> I ran a few benchmarks on v17 last week (sysbench oltp readonly, kerndevel from >>> mmtests, a memcg-ized version of the readtwice case I cooked up) and then today >>> discovered there's a chance I wasn't running the right kernels, so I'm redoing >>> them on v18. > > Neither kernel compile nor git checkout in the root cgroup changed much, just > 0.31% slower on elapsed time for the compile, so no significant regressions > there. Now for sysbench again. > Thanks a lot for testing report! Alex FROM centos:8 MAINTAINER Alexs #WORKDIR /vm-scalability #RUN yum update -y && yum groupinstall "Development Tools" -y && yum clean all && \ #examples https://www.linuxtechi.com/build-docker-container-images-with-dockerfile/ RUN yum install git xfsprogs patch make gcc -y && yum clean all && \ git clone https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ && \ cd vm-scalability && make usemem COPY readtwice.patch /vm-scalability/ RUN cd vm-scalability && patch -p1 < readtwice.patch diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice index 85533b248634..57cb97d121ae 100755 --- a/case-lru-file-readtwice +++ b/case-lru-file-readtwice @@ -15,23 +15,30 @@ . ./hw_vars -for i in `seq 1 $nr_task` -do - create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task)) - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 & - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 & -done +OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g +TEST_CASES=${@:-$(echo case-*)} + +echo $((1<<30)) > /proc/sys/vm/max_map_count +echo $((1<<20)) > /proc/sys/kernel/threads-max +echo 1 > /proc/sys/vm/overcommit_memory +#echo 3 > /proc/sys/vm/drop_caches + + +i=1 + +if [ "$1" == "m" ];then + mount_tmpfs + create_sparse_root + create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES)) + exit +fi + + +if [ "$1" == "r" ];then + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)& + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)& +fi wait sleep 1 -for file in $TMPFS_MNT/dd-output-* -do - [ -s "$file" ] || { - echo "dd output file empty: $file" >&2 - } - cat $file - rm $file -done - -rm `seq -f $SPARSE_FILE-%g 1 $nr_task` diff --git a/hw_vars b/hw_vars index 8731cefb9f57..ceeaa9f17c0b 100755 --- a/hw_vars +++ b/hw_vars @@ -1,4 +1,4 @@ -#!/bin/sh +#!/bin/sh -ex if [ -n "$runtime" ]; then USEMEM="$CMD ./usemem --runtime $runtime" @@ -43,7 +43,7 @@ create_loop_devices() modprobe loop 2>/dev/null [ -e "/dev/loop0" ] || modprobe loop 2>/dev/null - for i in $(seq 0 8) + for i in $(seq 0 104) do [ -e "/dev/loop$i" ] && continue mknod /dev/loop$i b 7 $i @@ -101,11 +101,11 @@ remove_sparse_root () { create_sparse_file () { name=$1 size=$2 - # echo "$name is of size $size" + echo "$name is of size $size" $CMD truncate $name -s $size # dd if=/dev/zero of=$name bs=1k count=1 seek=$((size >> 10)) 2>/dev/null - # ls $SPARSE_ROOT - # ls /tmp/vm-scalability/* + ls $SPARSE_ROOT + ls /tmp/vm-scalability/* }
On Mon, 24 Aug 2020, Hugh Dickins wrote: > On Mon, 24 Aug 2020, Andrew Morton wrote: > > On Mon, 24 Aug 2020 20:54:33 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > > > > > The new version which bases on v5.9-rc2. > > Well timed and well based, thank you Alex. Particulary helpful to me, > to include those that already went into mmotm: it's a surer foundation > to test on top of the -rc2 base. > > > > the first 6 patches was picked into > > > linux-mm, and add patch 25-32 that do some further post optimization. > > > > 32 patches, version 18. That's quite heroic. I'm unsure whether I > > should merge it up at this point - what do people think? > > I'd love for it to go into mmotm - but not today. > > Version 17 tested out well. I've only just started testing version 18, > but I'm afraid there's been a number of "improvements" in between, > which show up as warnings (lots of VM_WARN_ON_ONCE_PAGE(!memcg) - > I think one or more of those are already in mmotm and under discussion > on the list, but I haven't read through yet, and I may have caught > more cases to examine; a per-cpu warning from munlock_vma_page(); Alex already posted the fix for that one. > something else flitted by at reboot time before I could read it). That one still eludes me, but I'm not giving it high priority. > No crashes so far, but I haven't got very far with it yet. > > I'll report back later in the week. Just a quick report for now: I have some fixes, not to Alex's patchset itself, but to things it revealed - a couple of which I knew of already, but better now be fixed. Once I've fleshed those out with comments and sent them in, I'll get down to review. Testing held up very well, no other problems seen in the patchset, and the 1/27 discovered something useful. I was going to say, no crashes observed at all, but one did crash this afternoon. But like before, I think it's something unrelated to Alex's work, just revealed now that I hammer harder on compaction (knowing that to be the hardest test for per-memcg lru_lock). It was a crash from checking PageWaiters on a Tail in wake_up_page(), called from end_page_writeback(), from ext4_finish_bio(): yet the page a tail of a shmem huge page. Linus's wake_up_page_bit() changes? No, I don't think so. It seems to me that once end_page_writeback() has done its test_clear_page_writeback(), it has no further hold on the struct page, which could be reused as part of a compound page by the time of wake_up_page()'s PageWaiters check. But I probably need to muse on that for longer. (I'm also kind-of-worried because Alex's patchset should make no functional difference, yet appears to fix some undebugged ZONE_DMA=y slow leak of memory that's been plaguing my testing for months. I mention that in case those vague words are enough to prompt an idea from someone, but cannot afford to spend much time on it.) Hugh
On Thu, Aug 27, 2020 at 12:01:00AM -0700, Hugh Dickins wrote: > It was a crash from checking PageWaiters on a Tail in wake_up_page(), > called from end_page_writeback(), from ext4_finish_bio(): yet the > page a tail of a shmem huge page. Linus's wake_up_page_bit() changes? > No, I don't think so. It seems to me that once end_page_writeback() > has done its test_clear_page_writeback(), it has no further hold on > the struct page, which could be reused as part of a compound page > by the time of wake_up_page()'s PageWaiters check. But I probably > need to muse on that for longer. I think you're right. Example: truncate_inode_pages_range() pagevec_lookup_entries() lock_page() --- ctx switch --- ext4_finish_bio() end_page_writeback() test_clear_page_writeback() --- ctx switch --- wait_on_page_writeback() <- noop truncate_inode_page() unlock_page() pagevec_release() ... page can now be allocated --- ctx switch --- wake_up_page() PageWaiters then has that check for PageTail. This isn't unique to ext4; the iomap completion path behaves the exact same way. The thing is, this is a harmless race. It seems unnecessary for anybody here to incur the overhead of adding a page ref to be sure the page isn't reallocated. We don't want to wake up the waiters before clearing the bit in question. I'm tempted to suggest this: static void wake_up_page(struct page *page, int bit) { - if (!PageWaiters(page)) + if (PageTail(page) || !PageWaiters(page)) return; wake_up_page_bit(page, bit); which only adds an extra read to the struct page that we were going to access anyway. Even that seems unnecessary though; PageWaiters is going to be clear. Maybe we can just change the PF policy from PF_ONLY_HEAD to PF_ANY. I don't think it's critical that we have this check. Nick?
On Wed, Aug 26, 2020 at 04:59:28PM +0800, Alex Shi wrote: > I clean up my testing and make it reproducable by a Dockerfile and a case patch which > attached. Ok, I'll give that a shot once I've taken care of sysbench. > >>> Even better would be a description of the problem you're having in production > >>> with lru_lock. We might be able to create at least a simulation of it to show > >>> what the expected improvement of your real workload is. > >> > >> we are using thousands memcgs in a machine, but as a simulation, I guess above case > >> could be helpful to show the problem. > > > > Using thousands of memcgs to do what? Any particulars about the type of > > workload? Surely it's more complicated than page cache reads :) > > Yes, the workload are quit different on different business, some use cpu a > lot, some use memory a lot, and some are may mixed. That's pretty vague, but I don't suppose I could do much better describing what all runs on our systems :-/ I went back to your v1 post to see what motivated you originally, and you had some results from aim9 but nothing about where this reared its head in the first place. How did you discover the bottleneck? I'm just curious about how lru_lock hurts in practice. > > Neither kernel compile nor git checkout in the root cgroup changed much, just > > 0.31% slower on elapsed time for the compile, so no significant regressions > > there. Now for sysbench again. Still working on getting repeatable sysbench runs, no luck so far. The numbers have stayed fairly consistent with your series but vary a lot on the base kernel, not sure why yet.
在 2020/8/28 上午9:40, Daniel Jordan 写道: > I went back to your v1 post to see what motivated you originally, and you had > some results from aim9 but nothing about where this reared its head in the > first place. How did you discover the bottleneck? I'm just curious about how > lru_lock hurts in practice. We have gotten very high 'sys' in some buiness/machines. And found much of time spent on the lru_lock and/or zone lock. Seems per memcg lru_lock could help this, but still no idea on zone lock. Thanks Alex
Miscellaneous Acks and NAKs and other comments on the beginning and the end of the series, but not much yet on the all-important middle. I'm hoping to be spared sending ~20 email replies to ~20 patches. [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged Acked-by: Hugh Dickins <hughd@google.com> if you make these changes: Please add "Add VM_WARN_ON_ONCE_PAGE() macro." or something like that to the commit message: that's a good addition that we shall find useful in other places, so please advertise it. Delete the four comment lines /* Readahead page is charged too, to see if other page uncharged */ which make no sense on their own. [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail Acked-by: Hugh Dickins <hughd@google.com> Though I'd prefer "mm/thp: use head for head page in lru_add_page_tail" to the unnecessarily vague "clean up". But you're right to keep this renaming separate from the code movement in the previous commit, and perhaps right to keep it from the more interesting cleanup next. [PATCH v18 05/32] mm/thp: remove code path which never got into This is a good simplification, but I see no sign that you understand why it's valid: it relies on lru_add_page_tail() being called while head refcount is frozen to 0: we would not get this far if someone else holds a reference to the THP - which they must hold if they have isolated the page from its lru (and that's true before or after your per-memcg changes - but even truer after those changes, since PageLRU can then be flipped without lru_lock at any instant): please explain something of this in the commit message. You revisit this same code in 18/32, and I much prefer the way it looks after that (if (list) {} else {}) - this 05/32 is a bit weird, it would be easier to understand if it just did VM_WARN_ON(1). Please pull the 18/32 mods back into this one, maybe adding a VM_WARN_ON(PageLRU) into the "if (list)" block too. [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page() Please merge into 05/32. But what do "Split_huge_page() must start with PageLRU(head)" and "Split start from PageLRU(head)" mean? Perhaps you mean that if list is NULL, then if the head was not on the LRU, then it cannot have got through page_ref_freeze(), because isolator would hold page ref? That is subtle, and deserves mention in the commit comment, but is not what you have said at all. s/unexpected/unexpectedly/. [PATCH v18 06/32] mm/thp: narrow lru locking Why? What part does this play in the series? "narrow lru locking" can also be described as "widen page cache locking": you are changing the lock ordering, and not giving any reason to do so. This may be an excellent change, or it may be a terrible change: I find that usually lock ordering is forced upon us, and it's rare to meet an instance like this that could go either way, and I don't know myself how to judge it. I do want this commit to go in, partly because it has been present in all the testing we have done, and partly because I *can at last* see a logical advantage to it - it also nests lru_lock inside memcg->move_lock, allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when getting per-memcg lru_lock - though only in one place, starting in v17, do you actually use that (and, warning: it's not used correctly there). I'm not very bothered by how the local_irq_disable() looks to RT: THP seems a very bad idea in an RT kernel. Earlier I asked you to run this past Kirill and Matthew and Johannes: you did so, thank you, and Kirill has blessed it, and no one has nacked it, and I have not noticed any disadvantage from this change in lock ordering (documented in 23/32), so I'm now going to say Acked-by: Hugh Dickins <hughd@google.com> But I wish you could give some reason for it in the commit message! Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Is that correct? Or Wei Yang suggested some part of it perhaps? [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru Perhaps; or perhaps by the time the pagevec is full, the page has been drained to the lru, and it should be deactivated? I'm indifferent. Is this important for per-memcg lru_lock? [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding You are optimizing for a case which you then mark unlikely(), and I don't agree that it makes the flow clearer; but you've added a useful comment on the race there, so please s/intergrity/integrity/ in commit message and in code comment, then Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting I strongly approve of removing the abuse of lru_lock here, but the patch is wrong: you are mistaken in thinking the PageLRU check after get_page_unless_zero() is an unnecessary duplicaton of the one before. No, the one before is an optimization, and the one after is essential, for telling whether this page (arrived at via pfn, like in compaction) is the kind of page we understand (address_space or anon_vma or KSM stable_node pointer in page->mapping), so can use rmap_walk() on. Please replace this patch by mine from the tarball I posted a year ago, which keeps both checks, and justifies it against why the lru_lock was put there in the first place - thanks to Vladimir for pointing me to that mail thread when I tried to submit this patch a few years ago. Appended at the end of this mail. [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer I'm indifferent: I see your point about the name, but it hasn't caused confusion in ten years, whereas changing name and tracepoint might cause confusion. And how does changing the name help per-memcg lru_lock? It just seems to be a random patch from your private tree. If it's Acked by Mel who coined the name, or someone who has done a lot of work there (Vlastimil? Joonsoo?), fine, I have no problem with it; but I don't see what it's doing in this series - better left out. [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg This is a very useful change for helping lockdep: Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec Hah, I see this is in my name. Well, I did once suggest folding this into one of your patches, but it's not an optimization, and that was before you added VM_WARN_ON_ONCE_PAGE() here. It looks strange now, a VM_BUG_ON_PAGE() next to a VM_WARN_ON_ONCE_PAGE(); and the latter will catch that PageTail case anyway (once). And although I feel slightly safer with READ_ONCE(page->mem_cgroup), I'm finding it hard to justify, doing so here but not in other places: particularly since just above it says "This function relies on page->mem_cgroup being stable". Let's just drop this patch. [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn Yes, nice cleanup, I don't see why it should be different and force an unused arg on the others. But I have one reservation: you added comment + * + * pagevec_move_tail_fn() must be called with IRQ disabled. + * Otherwise this may cause nasty races. above rotate_reclaimable_page(), having deleted pagevec_move_tail() which had such a comment. It doesn't make sense, because pagevec_move_tail_fn() is called with IRQ disabled anyway. That comment had better say + * + * rotate_reclaimable_page() must disable IRQs, to prevent nasty races. I dimly remember hitting those nasty races many years ago, but forget the details. Oh, one other thing, you like to use "func" as abbreviation for "function", okay: but then at the end of the commit message you say "no func change" - please change that to "No functional change". Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page "w/o functional changes" instead of "w/o function changes". But please just merge this into the next, 15/32: there is no point in separating them. [PATCH v18 15/32] mm/lru: move lock into lru_note_cost [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU [PATCH v18 17/32] mm/compaction: do page isolation first in compaction [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru [PATCH v18 23/32] mm/lru: revise the comments of lru_lock [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page I have tested, but not yet studied these, and it's a good point to break off and send my comments so far, because 15/32 is where the cleanups end and per-memcg lru_lock kind-of begins - lru_note_cost() being potentially more costly, because it needs to use a different lock at each level. (When I tried rebasing my own series a couple of months ago, I stopped here at lru_note_cost() too, wondering if there was a better way.) Two things I do know about from testing, that need to be corrected: check_move_unevictable_pages() needs protection from page->memcg being changed while doing the relock_page_lruvec_irq(): could use TestClearPageLRU there (!PageLRU pages are safely skipped), but that doubles the number of atomic ops involved. I intended to use lock_page_memcg() instead, but that's harder than you'd expect: so probably TestClearPageLRU will be the best to use there for now. The use of lock_page_memcg() in __munlock_pagevec() in 20/32, introduced in patchset v17, looks good but it isn't: I was lucky that systemd at reboot did some munlocking that exposed the problem to lockdep. The first time into the loop, lock_page_memcg() is done before lru_lock (as 06/32 has allowed); but the second time around the loop, it is done while still holding lru_lock. lock_page_memcg() really needs to be absorbed into (a variant of) relock_page_lruvec(), and I do have that (it's awkward because of the different ways in which the IRQ flags are handled). And out of curiosity, I've also tried using that in mm/swap.c too, instead of the TestClearPageLRU technique: lockdep is happy, but an update_lru_size() warning showed that it cannot safely be mixed with the TestClearPageLRU technique (that I'd left in isolate_lru_page()). So I'll stash away that relock_page_lruvec(), and consider what's best for mm/mlock.c: now that I've posted these comments so far, that's my priority, then to get the result under testing again, before resuming these comments. Jumping over 15-26, and resuming comments on recent additions: [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock Could we please drop this one for the moment? And come back to it later when the basic series is safely in. It's a good idea to try sorting together those pages which come under the same lock (though my guess is that they naturally gather themselves together quite well already); but I'm not happy adding 360 bytes to the kernel stack here (and that in addition to 192 bytes of horrid pseudo-vma in the shmem swapin case), though that could be avoided by making it per-cpu. But I hope there's a simpler way of doing it, as efficient, but also useful for the other pagevec operations here: perhaps scanning the pagevec for same page-> mem_cgroup (and flags node bits), NULLing entries as they are done. Another, easily fixed, minor defect in this patch: if I'm reading it right, it reverses the order in which the pages are put on the lru? [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block Most of this consists of replacing "locked" by "lruvec", which is good: but please fold those changes back into 20/32 (or would it be 17/32? I've not yet looked into the relationship between those two), so we can then see more clearly what change this 28/32 (will need renaming!) actually makes, to use lruvec_holds_page_lru_lock(). That may be a good change, but it's mixed up with the "locked"->"lruvec" at present, and I think you could have just used lruvec for locked all along (but of course there's a place where you'll need new_lruvec too). [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block NAK. I agree that isolate_migratepages_block() looks nicer this way, but take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is where set_page_refcounted() changes page->_refcount from 0 to 1, allowing a racing get_page_unless_zero() to succeed; then later prep_compound_page() is where PageHead and PageTails get set. So there's a small race window in which this patch could deliver a compound page when it should not. [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip I haven't looked at this yet (but recall that per-memcg lru_lock can change the point at which compaction should skip a contended lock: IIRC the current kernel needs nothing extra, whereas some earlier kernels did need extra; but when I look at 30/32, may find these remarks irrelevant). [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages The title of this patch is definitely wrong: there was an explicit page decrement there before (put_page), now it's wrapping it up inside a WARN_ON(). We usually prefer to avoid doing functional operations inside WARN/BUGs, but I think I'll overlook that - anyone else worried? The comment is certainly better than what was there before: yes, this warning reflects the difficulty we have in thinking about the TestClearPageLRU protocol: which I'm still not sold on, but agree we should proceed with. With a change in title, perhaps "mm: add warning where TestClearPageLRU failed on freeable page"? Acked-by: Hugh Dickins <hughd@google.com> [PATCH v18 32/32] mm: Split release_pages work into 3 passes I haven't looked at this yet (but seen no problem with it in testing). And finally, here's my replacement (rediffed against 5.9-rc) for [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting From: Hugh Dickins <hughd@google.com> Date: Mon, 13 Jun 2016 19:43:34 -0700 Subject: [PATCH] mm: page_idle_get_page() does not need lru_lock It is necessary for page_idle_get_page() to recheck PageLRU() after get_page_unless_zero(), but holding lru_lock around that serves no useful purpose, and adds to lru_lock contention: delete it. See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the discussion that led to lru_lock there; but __page_set_anon_rmap() now uses WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs() using rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but not entirely prevented by page_count() check in ksm.c's write_protect_page(): that risk being shared with page_referenced() and not helped by lru_lock). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> --- mm/page_idle.c | 4 ---- 1 file changed, 4 deletions(-) --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -32,19 +32,15 @@ static struct page *page_idle_get_page(unsigned long pfn) { struct page *page = pfn_to_online_page(pfn); - pg_data_t *pgdat; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; - pgdat = page_pgdat(page); - spin_lock_irq(&pgdat->lru_lock); if (unlikely(!PageLRU(page))) { put_page(page); page = NULL; } - spin_unlock_irq(&pgdat->lru_lock); return page; }
On Tue, Sep 08, 2020 at 04:41:00PM -0700, Hugh Dickins wrote: [...] >[PATCH v18 06/32] mm/thp: narrow lru locking >Why? What part does this play in the series? "narrow lru locking" can >also be described as "widen page cache locking": you are changing the >lock ordering, and not giving any reason to do so. This may be an >excellent change, or it may be a terrible change: I find that usually >lock ordering is forced upon us, and it's rare to meet an instance like >this that could go either way, and I don't know myself how to judge it. > >I do want this commit to go in, partly because it has been present in >all the testing we have done, and partly because I *can at last* see a >logical advantage to it - it also nests lru_lock inside memcg->move_lock, >allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when >getting per-memcg lru_lock - though only in one place, starting in v17, >do you actually use that (and, warning: it's not used correctly there). > >I'm not very bothered by how the local_irq_disable() looks to RT: THP >seems a very bad idea in an RT kernel. Earlier I asked you to run this >past Kirill and Matthew and Johannes: you did so, thank you, and Kirill >has blessed it, and no one has nacked it, and I have not noticed any >disadvantage from this change in lock ordering (documented in 23/32), >so I'm now going to say > >Acked-by: Hugh Dickins <hughd@google.com> > >But I wish you could give some reason for it in the commit message! > >Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >Is that correct? Or Wei Yang suggested some part of it perhaps? > If my memory is correct, we had some offline discussion about this change.
On Thu, Aug 27, 2020 at 09:40:22PM -0400, Daniel Jordan wrote: > I went back to your v1 post to see what motivated you originally, and you had > some results from aim9 but nothing about where this reared its head in the > first place. How did you discover the bottleneck? I'm just curious about how > lru_lock hurts in practice. I think making lru_lock per-memcg helps in colocated environment: some workloads are of high priority while some workloads are of low priority. For these low priority workloads, we may even want to use some swap for it to save memory and this can cause frequent alloc/reclaim, depending on its workingset etc. and these alloc/reclaim need to hold the global lru lock and zone lock. And then when the high priority workloads do page fault, their performance can be adversely affected and that is not acceptible since these high priority workloads normally have strict SLA requirement.
On Wed 09-09-20 10:44:32, Aaron Lu wrote: > On Thu, Aug 27, 2020 at 09:40:22PM -0400, Daniel Jordan wrote: > > I went back to your v1 post to see what motivated you originally, and you had > > some results from aim9 but nothing about where this reared its head in the > > first place. How did you discover the bottleneck? I'm just curious about how > > lru_lock hurts in practice. > > I think making lru_lock per-memcg helps in colocated environment: some > workloads are of high priority while some workloads are of low priority. > > For these low priority workloads, we may even want to use some swap for > it to save memory and this can cause frequent alloc/reclaim, depending > on its workingset etc. and these alloc/reclaim need to hold the global > lru lock and zone lock. And then when the high priority workloads do > page fault, their performance can be adversely affected and that is not > acceptible since these high priority workloads normally have strict SLA > requirement. While this all sounds reasonably. We are lacking _any_ numbers to actually make that a solid argumentation rather than hand waving. Having something solid is absolutely necessary for a big change like this.
Hi Hugh, Thanks a lot for so rich review and comments! 在 2020/9/9 上午7:41, Hugh Dickins 写道: > Miscellaneous Acks and NAKs and other comments on the beginning and > the end of the series, but not much yet on the all-important middle. > I'm hoping to be spared sending ~20 email replies to ~20 patches. > > [PATCH v18 01/32] mm/memcg: warning on !memcg after readahead page charged > Acked-by: Hugh Dickins <hughd@google.com> > if you make these changes: > > Please add "Add VM_WARN_ON_ONCE_PAGE() macro." or something like that to > the commit message: that's a good addition that we shall find useful in > other places, so please advertise it. Accepted! > > Delete the four comment lines > /* Readahead page is charged too, to see if other page uncharged */ > which make no sense on their own. > Accepted! > [PATCH v18 02/32] mm/memcg: bail out early from swap accounting when memcg is disabled > Acked-by: Hugh Dickins <hughd@google.com> > > [PATCH v18 03/32] mm/thp: move lru_add_page_tail func to huge_memory.c > Acked-by: Hugh Dickins <hughd@google.com> > > [PATCH v18 04/32] mm/thp: clean up lru_add_page_tail > Acked-by: Hugh Dickins <hughd@google.com> > > Though I'd prefer "mm/thp: use head for head page in lru_add_page_tail" > to the unnecessarily vague "clean up". But you're right to keep this > renaming separate from the code movement in the previous commit, and > perhaps right to keep it from the more interesting cleanup next. > > [PATCH v18 05/32] mm/thp: remove code path which never got into > This is a good simplification, but I see no sign that you understand > why it's valid: it relies on lru_add_page_tail() being called while > head refcount is frozen to 0: we would not get this far if someone > else holds a reference to the THP - which they must hold if they have > isolated the page from its lru (and that's true before or after your > per-memcg changes - but even truer after those changes, since PageLRU > can then be flipped without lru_lock at any instant): please explain > something of this in the commit message. Is the following commit log better? split_huge_page() will never call on a page which isn't on lru list, so this code never got a chance to run, and should not be run, to add tail pages on a lru list which head page isn't there. Hugh Dickins' mentioned: The path should never be called since lru_add_page_tail() being called while head refcount is frozen to 0: we would not get this far if someone else holds a reference to the THP - which they must hold if they have isolated the page from its lru. Although the bug was never triggered, it'better be removed for code correctness, and add a warn for unexpected calling. > > You revisit this same code in 18/32, and I much prefer the way it looks > after that (if (list) {} else {}) - this 05/32 is a bit weird, it would > be easier to understand if it just did VM_WARN_ON(1). Please pull the > 18/32 mods back into this one, maybe adding a VM_WARN_ON(PageLRU) into > the "if (list)" block too. Accepted. > > [PATCH v18 18/32] mm/thp: add tail pages into lru anyway in split_huge_page() > Please merge into 05/32. > But what do "Split_huge_page() must start with > PageLRU(head)" and "Split start from PageLRU(head)" mean? Perhaps you mean > that if list is NULL, then if the head was not on the LRU, then it cannot > have got through page_ref_freeze(), because isolator would hold page ref? No, what I mean is only PageLRU(head) could be called and get here. Would you like to give a suggestion to replace old one? > That is subtle, and deserves mention in the commit comment, but is not > what you have said at all. s/unexpected/unexpectedly/. Thanks! > > [PATCH v18 06/32] mm/thp: narrow lru locking > Why? What part does this play in the series? "narrow lru locking" can > also be described as "widen page cache locking": Uh, the page cache locking isn't widen, it's still on the old place. > you are changing the > lock ordering, and not giving any reason to do so. This may be an > excellent change, or it may be a terrible change: I find that usually > lock ordering is forced upon us, and it's rare to meet an instance like > this that could go either way, and I don't know myself how to judge it. > > I do want this commit to go in, partly because it has been present in > all the testing we have done, and partly because I *can at last* see a > logical advantage to it - it also nests lru_lock inside memcg->move_lock, I must overlook sth on the lock nest. Would you like to reveal it for me? Thanks! > allowing lock_page_memcg() to be used to stabilize page->mem_cgroup when > getting per-memcg lru_lock - though only in one place, starting in v17, > do you actually use that (and, warning: it's not used correctly there). > > I'm not very bothered by how the local_irq_disable() looks to RT: THP > seems a very bad idea in an RT kernel. Earlier I asked you to run this > past Kirill and Matthew and Johannes: you did so, thank you, and Kirill > has blessed it, and no one has nacked it, and I have not noticed any > disadvantage from this change in lock ordering (documented in 23/32), > so I'm now going to say > > Acked-by: Hugh Dickins <hughd@google.com> > > But I wish you could give some reason for it in the commit message! It's a head scratch task. Would you like to tell me what's detailed info should be there? Thanks! > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > Is that correct? Or Wei Yang suggested some part of it perhaps? Yes, we talked a lot to confirm the locking change is safe. > > [PATCH v18 07/32] mm/swap.c: stop deactivate_file_page if page not on lru > Perhaps; or perhaps by the time the pagevec is full, the page has been > drained to the lru, and it should be deactivated? I'm indifferent. > Is this important for per-memcg lru_lock? It's no much related with theme, so I'm fine to remove it. > > [PATCH v18 08/32] mm/vmscan: remove unnecessary lruvec adding > You are optimizing for a case which you then mark unlikely(), and I > don't agree that it makes the flow clearer; but you've added a useful > comment on the race there, so please s/intergrity/integrity/ in commit thanks for fixing. > message and in code comment, then > Acked-by: Hugh Dickins <hughd@google.com> > > [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting > I strongly approve of removing the abuse of lru_lock here, but the > patch is wrong: you are mistaken in thinking the PageLRU check after > get_page_unless_zero() is an unnecessary duplicaton of the one before. > No, the one before is an optimization, and the one after is essential, > for telling whether this page (arrived at via pfn, like in compaction) > is the kind of page we understand (address_space or anon_vma or KSM > stable_node pointer in page->mapping), so can use rmap_walk() on. > > Please replace this patch by mine from the tarball I posted a year ago, > which keeps both checks, and justifies it against why the lru_lock was > put there in the first place - thanks to Vladimir for pointing me to > that mail thread when I tried to submit this patch a few years ago. > Appended at the end of this mail. You are right. thanks! > > [PATCH v18 10/32] mm/compaction: rename compact_deferred as compact_should_defer > I'm indifferent: I see your point about the name, but it hasn't caused > confusion in ten years, whereas changing name and tracepoint might cause > confusion. And how does changing the name help per-memcg lru_lock? It > just seems to be a random patch from your private tree. If it's Acked > by Mel who coined the name, or someone who has done a lot of work there > (Vlastimil? Joonsoo?), fine, I have no problem with it; but I don't > see what it's doing in this series - better left out. I will drop this patch. > > [PATCH v18 11/32] mm/memcg: add debug checking in lock_page_memcg > This is a very useful change for helping lockdep: > Acked-by: Hugh Dickins <hughd@google.com> > > [PATCH v18 12/32] mm/memcg: optimize mem_cgroup_page_lruvec > Hah, I see this is in my name. Well, I did once suggest folding this > into one of your patches, but it's not an optimization, and that was > before you added VM_WARN_ON_ONCE_PAGE() here. It looks strange now, > a VM_BUG_ON_PAGE() next to a VM_WARN_ON_ONCE_PAGE(); and the latter > will catch that PageTail case anyway (once). And although I feel > slightly safer with READ_ONCE(page->mem_cgroup), I'm finding it hard > to justify, doing so here but not in other places: particularly since > just above it says "This function relies on page->mem_cgroup being > stable". Let's just drop this patch. Accepted. Thanks! > > [PATCH v18 13/32] mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn > Yes, nice cleanup, I don't see why it should be different and force an > unused arg on the others. But I have one reservation: you added comment > + * > + * pagevec_move_tail_fn() must be called with IRQ disabled. > + * Otherwise this may cause nasty races. > above rotate_reclaimable_page(), having deleted pagevec_move_tail() which > had such a comment. It doesn't make sense, because pagevec_move_tail_fn() > is called with IRQ disabled anyway. That comment had better say > + * > + * rotate_reclaimable_page() must disable IRQs, to prevent nasty races. > I dimly remember hitting those nasty races many years ago, but forget > the details. Oh, one other thing, you like to use "func" as abbreviation > for "function", okay: but then at the end of the commit message you say > "no func change" - please change that to "No functional change". > Acked-by: Hugh Dickins <hughd@google.com> > Accepted. Thanks! > [PATCH v18 14/32] mm/lru: move lru_lock holding in func lru_note_cost_page > "w/o functional changes" instead of "w/o function changes". But please > just merge this into the next, 15/32: there is no point in separating them. > > [PATCH v18 15/32] mm/lru: move lock into lru_note_cost > [PATCH v18 16/32] mm/lru: introduce TestClearPageLRU > [PATCH v18 17/32] mm/compaction: do page isolation first in compaction > [PATCH v18 19/32] mm/swap.c: serialize memcg changes in pagevec_lru_move_fn > [PATCH v18 20/32] mm/lru: replace pgdat lru_lock with lruvec lock > [PATCH v18 21/32] mm/lru: introduce the relock_page_lruvec function > [PATCH v18 22/32] mm/vmscan: use relock for move_pages_to_lru > [PATCH v18 23/32] mm/lru: revise the comments of lru_lock > [PATCH v18 24/32] mm/pgdat: remove pgdat lru_lock > [PATCH v18 25/32] mm/mlock: remove lru_lock on TestClearPageMlocked in munlock_vma_page > [PATCH v18 26/32] mm/mlock: remove __munlock_isolate_lru_page > > I have tested, but not yet studied these, and it's a good point to break > off and send my comments so far, because 15/32 is where the cleanups end > and per-memcg lru_lock kind-of begins - lru_note_cost() being potentially > more costly, because it needs to use a different lock at each level. > (When I tried rebasing my own series a couple of months ago, I stopped > here at lru_note_cost() too, wondering if there was a better way.) > > Two things I do know about from testing, that need to be corrected: > > check_move_unevictable_pages() needs protection from page->memcg > being changed while doing the relock_page_lruvec_irq(): could use > TestClearPageLRU there (!PageLRU pages are safely skipped), but > that doubles the number of atomic ops involved. I intended to use > lock_page_memcg() instead, but that's harder than you'd expect: so > probably TestClearPageLRU will be the best to use there for now. Accepted. Thanks! > > The use of lock_page_memcg() in __munlock_pagevec() in 20/32, > introduced in patchset v17, looks good but it isn't: I was lucky that > systemd at reboot did some munlocking that exposed the problem to lockdep. > The first time into the loop, lock_page_memcg() is done before lru_lock > (as 06/32 has allowed); but the second time around the loop, it is done > while still holding lru_lock. I don't know the details of lockdep show. Just wondering could it possible to solid the move_lock/lru_lock sequence? or try other blocking way which mentioned in commit_charge()? > > lock_page_memcg() really needs to be absorbed into (a variant of) > relock_page_lruvec(), and I do have that (it's awkward because of > the different ways in which the IRQ flags are handled). And out of > curiosity, I've also tried using that in mm/swap.c too, instead of the > TestClearPageLRU technique: lockdep is happy, but an update_lru_size() > warning showed that it cannot safely be mixed with the TestClearPageLRU > technique (that I'd left in isolate_lru_page()). So I'll stash away > that relock_page_lruvec(), and consider what's best for mm/mlock.c: > now that I've posted these comments so far, that's my priority, then > to get the result under testing again, before resuming these comments. No idea of your solution, but looking forward for your good news! :) > > Jumping over 15-26, and resuming comments on recent additions: > > [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock > Could we please drop this one for the moment? And come back to it later > when the basic series is safely in. It's a good idea to try sorting > together those pages which come under the same lock (though my guess is > that they naturally gather themselves together quite well already); but > I'm not happy adding 360 bytes to the kernel stack here (and that in > addition to 192 bytes of horrid pseudo-vma in the shmem swapin case), > though that could be avoided by making it per-cpu. But I hope there's > a simpler way of doing it, as efficient, but also useful for the other > pagevec operations here: perhaps scanning the pagevec for same page-> > mem_cgroup (and flags node bits), NULLing entries as they are done. > Another, easily fixed, minor defect in this patch: if I'm reading it > right, it reverses the order in which the pages are put on the lru? this patch could give about 10+% performance gain on my multiple memcg readtwice testing. fairness locking cost the performance much. I also tried per cpu solution but that cause much trouble of per cpu func things, and looks no benefit except a bit struct size of stack, so if stack size still fine. May we could use the solution and improve it better. like, functionlize, fix the reverse issue etc. > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block > Most of this consists of replacing "locked" by "lruvec", which is good: > but please fold those changes back into 20/32 (or would it be 17/32? > I've not yet looked into the relationship between those two), so we > can then see more clearly what change this 28/32 (will need renaming!) > actually makes, to use lruvec_holds_page_lru_lock(). That may be a > good change, but it's mixed up with the "locked"->"lruvec" at present, > and I think you could have just used lruvec for locked all along > (but of course there's a place where you'll need new_lruvec too). Uh, let me rethink about this. anyway the patch is logically different from patch 20 since it's need a new function lruvec_holds_page_lru_lock. > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block > NAK. I agree that isolate_migratepages_block() looks nicer this way, but > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing > a racing get_page_unless_zero() to succeed; then later prep_compound_page() > is where PageHead and PageTails get set. So there's a small race window in > which this patch could deliver a compound page when it should not. will drop this patch. > > [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip > I haven't looked at this yet (but recall that per-memcg lru_lock can > change the point at which compaction should skip a contended lock: IIRC > the current kernel needs nothing extra, whereas some earlier kernels did > need extra; but when I look at 30/32, may find these remarks irrelevant). will wait for your further comments. :) > > [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages > The title of this patch is definitely wrong: there was an explicit page > decrement there before (put_page), now it's wrapping it up inside a > WARN_ON(). We usually prefer to avoid doing functional operations > inside WARN/BUGs, but I think I'll overlook that - anyone else worried? > The comment is certainly better than what was there before: yes, this > warning reflects the difficulty we have in thinking about the > TestClearPageLRU protocol: which I'm still not sold on, but > agree we should proceed with. With a change in title, perhaps > "mm: add warning where TestClearPageLRU failed on freeable page"? > Acked-by: Hugh Dickins <hughd@google.com> > Accepted, thanks > [PATCH v18 32/32] mm: Split release_pages work into 3 passes > I haven't looked at this yet (but seen no problem with it in testing). > > And finally, here's my replacement (rediffed against 5.9-rc) for > [PATCH v18 09/32] mm/page_idle: no unlikely double check for idle page counting > > From: Hugh Dickins <hughd@google.com> > Date: Mon, 13 Jun 2016 19:43:34 -0700 > Subject: [PATCH] mm: page_idle_get_page() does not need lru_lock accepted, thanks! > > It is necessary for page_idle_get_page() to recheck PageLRU() after > get_page_unless_zero(), but holding lru_lock around that serves no > useful purpose, and adds to lru_lock contention: delete it. > > See https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop for the > discussion that led to lru_lock there; but __page_set_anon_rmap() now uses > WRITE_ONCE(), and I see no other risk in page_idle_clear_pte_refs() using > rmap_walk() (beyond the risk of racing PageAnon->PageKsm, mostly but not > entirely prevented by page_count() check in ksm.c's write_protect_page(): > that risk being shared with page_referenced() and not helped by lru_lock). > > Signed-off-by: Hugh Dickins <hughd@google.com> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Alex Shi <alex.shi@linux.alibaba.com> > --- > mm/page_idle.c | 4 ---- > 1 file changed, 4 deletions(-) > > --- a/mm/page_idle.c > +++ b/mm/page_idle.c > @@ -32,19 +32,15 @@ > static struct page *page_idle_get_page(unsigned long pfn) > { > struct page *page = pfn_to_online_page(pfn); > - pg_data_t *pgdat; > > if (!page || !PageLRU(page) || > !get_page_unless_zero(page)) > return NULL; > > - pgdat = page_pgdat(page); > - spin_lock_irq(&pgdat->lru_lock); > if (unlikely(!PageLRU(page))) { > put_page(page); > page = NULL; > } > - spin_unlock_irq(&pgdat->lru_lock); > return page; > } > >
On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote: > <snip> > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block > Most of this consists of replacing "locked" by "lruvec", which is good: > but please fold those changes back into 20/32 (or would it be 17/32? > I've not yet looked into the relationship between those two), so we > can then see more clearly what change this 28/32 (will need renaming!) > actually makes, to use lruvec_holds_page_lru_lock(). That may be a > good change, but it's mixed up with the "locked"->"lruvec" at present, > and I think you could have just used lruvec for locked all along > (but of course there's a place where you'll need new_lruvec too). I am good with my patch being folded in. No need to keep it separate. > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block > NAK. I agree that isolate_migratepages_block() looks nicer this way, but > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing > a racing get_page_unless_zero() to succeed; then later prep_compound_page() > is where PageHead and PageTails get set. So there's a small race window in > which this patch could deliver a compound page when it should not. So the main motivation for the patch was to avoid the case where we are having to reset the LRU flag. One question I would have is what if we swapped the code block with the __isolate_lru_page_prepare section? WIth that we would be taking a reference on the page, then verifying the LRU flag is set, and then testing for compound page flag bit. Would doing that close the race window since the LRU flag being set should indicate that the allocation has already been completed has it not? > [PATCH v18 30/32] mm: Drop use of test_and_set_skip in favor of just setting skip > I haven't looked at this yet (but recall that per-memcg lru_lock can > change the point at which compaction should skip a contended lock: IIRC > the current kernel needs nothing extra, whereas some earlier kernels did > need extra; but when I look at 30/32, may find these remarks irrelevant). > > [PATCH v18 31/32] mm: Add explicit page decrement in exception path for isolate_lru_pages > The title of this patch is definitely wrong: there was an explicit page > decrement there before (put_page), now it's wrapping it up inside a > WARN_ON(). We usually prefer to avoid doing functional operations > inside WARN/BUGs, but I think I'll overlook that - anyone else worried? > The comment is certainly better than what was there before: yes, this > warning reflects the difficulty we have in thinking about the > TestClearPageLRU protocol: which I'm still not sold on, but > agree we should proceed with. With a change in title, perhaps > "mm: add warning where TestClearPageLRU failed on freeable page"? > Acked-by: Hugh Dickins <hughd@google.com> I can update that and resubmit it if needed. I know there were also some suggestions from Matthew.
On Wed, 9 Sep 2020, Alex Shi wrote: > 在 2020/9/9 上午7:41, Hugh Dickins 写道: > > > > [PATCH v18 05/32] mm/thp: remove code path which never got into > > This is a good simplification, but I see no sign that you understand > > why it's valid: it relies on lru_add_page_tail() being called while > > head refcount is frozen to 0: we would not get this far if someone > > else holds a reference to the THP - which they must hold if they have > > isolated the page from its lru (and that's true before or after your > > per-memcg changes - but even truer after those changes, since PageLRU > > can then be flipped without lru_lock at any instant): please explain > > something of this in the commit message. > > Is the following commit log better? > > split_huge_page() will never call on a page which isn't on lru list, so > this code never got a chance to run, and should not be run, to add tail > pages on a lru list which head page isn't there. > > Hugh Dickins' mentioned: > The path should never be called since lru_add_page_tail() being called > while head refcount is frozen to 0: we would not get this far if someone > else holds a reference to the THP - which they must hold if they have > isolated the page from its lru. > > Although the bug was never triggered, it'better be removed for code > correctness, and add a warn for unexpected calling. Not much better, no. split_huge_page() can easily be called for a page which is not on the lru list at the time, and I don't know what was the bug which was never triggered. Stick with whatever text you end up with for the combination of 05/32 and 18/32, and I'll rewrite it after. > > [PATCH v18 06/32] mm/thp: narrow lru locking > > Why? What part does this play in the series? "narrow lru locking" can > > also be described as "widen page cache locking": > > Uh, the page cache locking isn't widen, it's still on the old place. I'm not sure if you're joking there. Perhaps just a misunderstanding. Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of those, but please do not get into cleaning it up now); but it removes the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come before them, and inserts a spin_lock(&pgdat->lru_lock) after them. You call that narrowing the lru locking, okay, but I see it as also pushing the page cache locking outwards: before this patch, page cache lock was taken inside lru_lock; after this patch, page cache lock is taken outside lru_lock. If you cannot see that, then I think you should not have touched this code at all; but it's what we have been testing, and I think we should go forward with it. > > But I wish you could give some reason for it in the commit message! > > It's a head scratch task. Would you like to tell me what's detailed info > should be there? Thanks! So, you don't know why you did it either: then it will be hard to justify. I guess I'll have to write something for it later. I'm strongly tempted just to drop the patch, but expect it will become useful later, for using lock_page_memcg() before getting lru_lock. > > Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > > Is that correct? Or Wei Yang suggested some part of it perhaps? > > Yes, we talked a lot to confirm the locking change is safe. Okay, but the patch was written by you, and sent by you to Andrew: that is not a case for "Signed-off-by: Someone Else". > > [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock > > Could we please drop this one for the moment? And come back to it later > > when the basic series is safely in. It's a good idea to try sorting > > together those pages which come under the same lock (though my guess is > > that they naturally gather themselves together quite well already); but > > I'm not happy adding 360 bytes to the kernel stack here (and that in > > addition to 192 bytes of horrid pseudo-vma in the shmem swapin case), > > though that could be avoided by making it per-cpu. But I hope there's > > a simpler way of doing it, as efficient, but also useful for the other > > pagevec operations here: perhaps scanning the pagevec for same page-> > > mem_cgroup (and flags node bits), NULLing entries as they are done. > > Another, easily fixed, minor defect in this patch: if I'm reading it > > right, it reverses the order in which the pages are put on the lru? > > this patch could give about 10+% performance gain on my multiple memcg > readtwice testing. fairness locking cost the performance much. Good to know, should have been mentioned. s/fairness/Repeated/ But what was the gain or loss on your multiple memcg readtwice testing without this patch, compared against node-only lru_lock? The 80% gain mentioned before, I presume. So this further optimization can wait until the rest is solid. > > I also tried per cpu solution but that cause much trouble of per cpu func > things, and looks no benefit except a bit struct size of stack, so if > stack size still fine. May we could use the solution and improve it better. > like, functionlize, fix the reverse issue etc. I don't know how important the stack depth consideration is nowadays: I still care, maybe others don't, since VMAP_STACK became an option. Yes, please fix the reversal (if I was right on that); and I expect you could use a singly linked list instead of the double. But I'll look for an alternative - later, once the urgent stuff is completed - and leave the acks on this patch to others. Hugh
On Wed, 9 Sep 2020, Alexander Duyck wrote: > On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote: > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block > > Most of this consists of replacing "locked" by "lruvec", which is good: > > but please fold those changes back into 20/32 (or would it be 17/32? > > I've not yet looked into the relationship between those two), so we > > can then see more clearly what change this 28/32 (will need renaming!) > > actually makes, to use lruvec_holds_page_lru_lock(). That may be a > > good change, but it's mixed up with the "locked"->"lruvec" at present, > > and I think you could have just used lruvec for locked all along > > (but of course there's a place where you'll need new_lruvec too). > > I am good with my patch being folded in. No need to keep it separate. Thanks. Though it was only the "locked"->"lruvec" changes I was suggesting to fold back, to minimize the diff, so that we could see your use of lruvec_holds_page_lru_lock() more clearly - you had not introduced that function at the stage of the earlier patches. But now that I stare at it again, using lruvec_holds_page_lru_lock() there doesn't look like an advantage to me: when it decides no, the same calculation is made all over again in mem_cgroup_page_lruvec(), whereas the code before only had to calculate it once. So, the code before looks better to me: I wonder, do you think that rcu_read_lock() is more expensive than I think it? There can be debug instrumentation that makes it heavier, but by itself it is very cheap (by design) - not worth branching around. > > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block > > NAK. I agree that isolate_migratepages_block() looks nicer this way, but > > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is > > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing > > a racing get_page_unless_zero() to succeed; then later prep_compound_page() > > is where PageHead and PageTails get set. So there's a small race window in > > which this patch could deliver a compound page when it should not. > > So the main motivation for the patch was to avoid the case where we > are having to reset the LRU flag. That would be satisfying. Not necessary, but I agree satisfying. Maybe depends also on your "skip" change, which I've not looked at yet? > One question I would have is what if > we swapped the code block with the __isolate_lru_page_prepare section? > WIth that we would be taking a reference on the page, then verifying > the LRU flag is set, and then testing for compound page flag bit. > Would doing that close the race window since the LRU flag being set > should indicate that the allocation has already been completed has it > not? Yes, I think that would be safe, and would look better. But I am very hesitant to give snap assurances here (I've twice missed out a vital PageLRU check from this sequence myself): it is very easy to deceive myself and only see it later. If you can see a bug in what's there before these patches, certainly we need to fix it. But adding non-essential patches to the already overlong series risks delaying it. Hugh
On Wed, Sep 9, 2020 at 5:32 PM Hugh Dickins <hughd@google.com> wrote: > > On Wed, 9 Sep 2020, Alexander Duyck wrote: > > On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote: > > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block > > > Most of this consists of replacing "locked" by "lruvec", which is good: > > > but please fold those changes back into 20/32 (or would it be 17/32? > > > I've not yet looked into the relationship between those two), so we > > > can then see more clearly what change this 28/32 (will need renaming!) > > > actually makes, to use lruvec_holds_page_lru_lock(). That may be a > > > good change, but it's mixed up with the "locked"->"lruvec" at present, > > > and I think you could have just used lruvec for locked all along > > > (but of course there's a place where you'll need new_lruvec too). > > > > I am good with my patch being folded in. No need to keep it separate. > > Thanks. Though it was only the "locked"->"lruvec" changes I was > suggesting to fold back, to minimize the diff, so that we could > see your use of lruvec_holds_page_lru_lock() more clearly - you > had not introduced that function at the stage of the earlier patches. > > But now that I stare at it again, using lruvec_holds_page_lru_lock() > there doesn't look like an advantage to me: when it decides no, the > same calculation is made all over again in mem_cgroup_page_lruvec(), > whereas the code before only had to calculate it once. > > So, the code before looks better to me: I wonder, do you think that > rcu_read_lock() is more expensive than I think it? There can be > debug instrumentation that makes it heavier, but by itself it is > very cheap (by design) - not worth branching around. Actually what I was more concerned with was the pointer chase that required the RCU lock. With this function we are able to compare a pair of pointers from the page and the lruvec and avoid the need for the RCU lock. The way the old code was working we had to crawl through the memcg to get to the lruvec before we could compare it to the one we currently hold. The general idea is to use the data we have instead of having to pull in some additional cache lines to perform the test. > > > > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block > > > NAK. I agree that isolate_migratepages_block() looks nicer this way, but > > > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is > > > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing > > > a racing get_page_unless_zero() to succeed; then later prep_compound_page() > > > is where PageHead and PageTails get set. So there's a small race window in > > > which this patch could deliver a compound page when it should not. > > > > So the main motivation for the patch was to avoid the case where we > > are having to reset the LRU flag. > > That would be satisfying. Not necessary, but I agree satisfying. > Maybe depends also on your "skip" change, which I've not looked at yet? My concern is that we have scenarios where isolate_migratepages_block could possibly prevent another page from being able to isolate a page. I'm mostly concerned with us potentially creating something like an isolation leak if multiple threads are doing something like clearing and then resetting the LRU flag. In my mind if we clear the LRU flag we should be certain we are going to remove the page as otherwise another thread would have done it if it would have been allowed access. > > One question I would have is what if > > we swapped the code block with the __isolate_lru_page_prepare section? > > WIth that we would be taking a reference on the page, then verifying > > the LRU flag is set, and then testing for compound page flag bit. > > Would doing that close the race window since the LRU flag being set > > should indicate that the allocation has already been completed has it > > not? > > Yes, I think that would be safe, and would look better. But I am > very hesitant to give snap assurances here (I've twice missed out > a vital PageLRU check from this sequence myself): it is very easy > to deceive myself and only see it later. I'm not looking for assurances, just sanity checks to make sure I am not missing something obvious. > If you can see a bug in what's there before these patches, certainly > we need to fix it. But adding non-essential patches to the already > overlong series risks delaying it. My concern ends up being that if we are clearing the bit and restoring it while holding the LRU lock we can effectively cause pages to become pseudo-pinned on the LRU. In my mind I would want us to avoid clearing the LRU flag until we know we are going to be pulling the page from the list once we take the lruvec lock. I interpret clearing of the flag to indicate the page has already been pulled, it just hasn't left the list yet. With us resetting the bit we are violating that which I worry will lead to issues.
在 2020/9/10 上午7:16, Hugh Dickins 写道: > On Wed, 9 Sep 2020, Alex Shi wrote: >> 在 2020/9/9 上午7:41, Hugh Dickins 写道: >>> >>> [PATCH v18 05/32] mm/thp: remove code path which never got into >>> This is a good simplification, but I see no sign that you understand >>> why it's valid: it relies on lru_add_page_tail() being called while >>> head refcount is frozen to 0: we would not get this far if someone >>> else holds a reference to the THP - which they must hold if they have >>> isolated the page from its lru (and that's true before or after your >>> per-memcg changes - but even truer after those changes, since PageLRU >>> can then be flipped without lru_lock at any instant): please explain >>> something of this in the commit message. >> >> Is the following commit log better? >> >> split_huge_page() will never call on a page which isn't on lru list, so >> this code never got a chance to run, and should not be run, to add tail >> pages on a lru list which head page isn't there. >> >> Hugh Dickins' mentioned: >> The path should never be called since lru_add_page_tail() being called >> while head refcount is frozen to 0: we would not get this far if someone >> else holds a reference to the THP - which they must hold if they have >> isolated the page from its lru. >> >> Although the bug was never triggered, it'better be removed for code >> correctness, and add a warn for unexpected calling. > > Not much better, no. split_huge_page() can easily be called for a page > which is not on the lru list at the time, Hi Hugh, Thanks for comments! There are some discussion on this point a couple of weeks ago, https://lkml.org/lkml/2020/7/9/760 Matthew Wilcox and Kirill have the following comments, > I don't understand how we get to split_huge_page() with a page that's > not on an LRU list. Both anonymous and page cache pages should be on > an LRU list. What am I missing? Right, and it's never got removed from LRU during the split. The tail pages have to be added to LRU because they now separate from the tail page.
On Fri, 11 Sep 2020, Alex Shi wrote: > 在 2020/9/10 上午7:16, Hugh Dickins 写道: > > On Wed, 9 Sep 2020, Alex Shi wrote: > >> 在 2020/9/9 上午7:41, Hugh Dickins 写道: > >>> > >>> [PATCH v18 05/32] mm/thp: remove code path which never got into > >>> This is a good simplification, but I see no sign that you understand > >>> why it's valid: it relies on lru_add_page_tail() being called while > >>> head refcount is frozen to 0: we would not get this far if someone > >>> else holds a reference to the THP - which they must hold if they have > >>> isolated the page from its lru (and that's true before or after your > >>> per-memcg changes - but even truer after those changes, since PageLRU > >>> can then be flipped without lru_lock at any instant): please explain > >>> something of this in the commit message. > >> > >> Is the following commit log better? > >> > >> split_huge_page() will never call on a page which isn't on lru list, so > >> this code never got a chance to run, and should not be run, to add tail > >> pages on a lru list which head page isn't there. > >> > >> Hugh Dickins' mentioned: > >> The path should never be called since lru_add_page_tail() being called > >> while head refcount is frozen to 0: we would not get this far if someone > >> else holds a reference to the THP - which they must hold if they have > >> isolated the page from its lru. > >> > >> Although the bug was never triggered, it'better be removed for code > >> correctness, and add a warn for unexpected calling. > > > > Not much better, no. split_huge_page() can easily be called for a page > > which is not on the lru list at the time, > > Hi Hugh, > > Thanks for comments! > > There are some discussion on this point a couple of weeks ago, > https://lkml.org/lkml/2020/7/9/760 > > Matthew Wilcox and Kirill have the following comments, > > I don't understand how we get to split_huge_page() with a page that's > > not on an LRU list. Both anonymous and page cache pages should be on > > an LRU list. What am I missing? > > Right, and it's never got removed from LRU during the split. The tail > pages have to be added to LRU because they now separate from the tail > page. > > -- > Kirill A. Shutemov Yes, those were among the mails that I read through before getting down to review. I was surprised by their not understanding, but it was a bit late to reply to that thread. Perhaps everybody had been focused on pages which have been and naturally belong on an LRU list, rather than pages which are on the LRU list at the instant that split_huge_page() is called. There are a number of places where PageLRU gets cleared, and a number of places where we del_page_from_lru_list(), I think you'll agree: your patches touch all or most of them. Let's think of a common one, isolate_lru_pages() used by page reclaim, but the same would apply to most of the others. Then there a number of places where split_huge_page() is called: I am having difficulty finding any of those which cannot race with page reclaim, but shall we choose anon THP's deferred_split_scan(), or shmem THP's shmem_punch_compound()? What prevents either of those from calling split_huge_page() at a time when isolate_lru_pages() has removed the page from LRU? But there's no problem in this race, because anyone isolating the page from LRU must hold their own reference to the page (to prevent it from being freed independently), and the can_split_huge_page() or page_ref_freeze() in split_huge_page_to_list() will detect that and fail the split with -EBUSY (or else succeed and prevent new references from being acquired). So this case never reaches lru_add_page_tail(). > > > and I don't know what was the > > bug which was never triggered. > > So the only path to the removed part should be a bug, like sth here, > https://lkml.org/lkml/2020/7/10/118 > or > https://lkml.org/lkml/2020/7/10/972 Oh, the use of split_huge_page() in __iommu_dma_alloc_pages() is just nonsense, I thought it had already been removed - perhaps some debate over __GFP_COMP held it up. Not something you need worry about in this patchset. > > > Stick with whatever text you end up with > > for the combination of 05/32 and 18/32, and I'll rewrite it after. > > I am not object to merge them into one, I just don't know how to say > clear about 2 patches in commit log. As patch 18, TestClearPageLRU > add the incorrect posibility of remove lru bit during split, that's > the reason of code path rewrite and a WARN there. I did not know that was why you were putting 18/32 in at that point, it does not mention TestClearPageLRU at all. But the fact remains that it's a nice cleanup, contains a reassuring WARN if we got it wrong (and I've suggested a WARN on the other branch too), it was valid before your changes, and it's valid after your changes. Please merge it back into the uglier 05/32, and again I'll rewrite whatever comment you come up with if necessary. > > > >>> [PATCH v18 06/32] mm/thp: narrow lru locking > >>> Why? What part does this play in the series? "narrow lru locking" can > >>> also be described as "widen page cache locking": > >> > >> Uh, the page cache locking isn't widen, it's still on the old place. > > > > I'm not sure if you're joking there. Perhaps just a misunderstanding. > > > > Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and > > xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of > > those, but please do not get into cleaning it up now); but it removes > > the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come > > before them, and inserts a spin_lock(&pgdat->lru_lock) after them. > > > > You call that narrowing the lru locking, okay, but I see it as also > > pushing the page cache locking outwards: before this patch, page cache > > lock was taken inside lru_lock; after this patch, page cache lock is > > taken outside lru_lock. If you cannot see that, then I think you > > should not have touched this code at all; but it's what we have > > been testing, and I think we should go forward with it. > > > >>> But I wish you could give some reason for it in the commit message! > >> > >> It's a head scratch task. Would you like to tell me what's detailed info > >> should be there? Thanks! > > > > So, you don't know why you did it either: then it will be hard to > > justify. I guess I'll have to write something for it later. I'm > > strongly tempted just to drop the patch, but expect it will become > > useful later, for using lock_page_memcg() before getting lru_lock. > > > > I thought the xa_lock and lru_lock relationship was described clear > in the commit log, You say "lru_lock and page cache xa_lock have no reason with current sequence", but you give no reason for inverting their sequence: "let's" is not a reason. > and still no idea of the move_lock in the chain. memcg->move_lock is what's at the heart of lock_page_memcg(), but as much as possible that tries to avoid the overhead of actually taking it, since moving memcg is a rare operation. For lock ordering, see the diagram in mm/rmap.c, which 23/32 updates to match this change. Before this commit: lru_lock > move_lock > i_pages lock was the expected lock ordering (but it looks as if the lru_lock > move_lock requirement came from my per-memcg lru_lock patches). After this commit: move_lock > i_pages lock > lru_lock is the required lock ordering, since there are strong reasons (in dirty writeback) for move_lock > i_pages lock. > Please refill them for what I overlooked. Will do, but not before reviewing your remaining patches. > Thanks! > > >>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> > >>> Is that correct? Or Wei Yang suggested some part of it perhaps? > >> > >> Yes, we talked a lot to confirm the locking change is safe. > > > > Okay, but the patch was written by you, and sent by you to Andrew: > > that is not a case for "Signed-off-by: Someone Else". > > > > Ok. let's remove his signed-off. > > >>> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock > >>> Could we please drop this one for the moment? And come back to it later > >>> when the basic series is safely in. It's a good idea to try sorting > >>> together those pages which come under the same lock (though my guess is > >>> that they naturally gather themselves together quite well already); but > >>> I'm not happy adding 360 bytes to the kernel stack here (and that in > >>> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case), > >>> though that could be avoided by making it per-cpu. But I hope there's > >>> a simpler way of doing it, as efficient, but also useful for the other > >>> pagevec operations here: perhaps scanning the pagevec for same page-> > >>> mem_cgroup (and flags node bits), NULLing entries as they are done. > >>> Another, easily fixed, minor defect in this patch: if I'm reading it > >>> right, it reverses the order in which the pages are put on the lru? > >> > >> this patch could give about 10+% performance gain on my multiple memcg > >> readtwice testing. fairness locking cost the performance much. > > > > Good to know, should have been mentioned. s/fairness/Repeated/ > > > > But what was the gain or loss on your multiple memcg readtwice > > testing without this patch, compared against node-only lru_lock? > > The 80% gain mentioned before, I presume. So this further > > optimization can wait until the rest is solid. > > the gain based on the patch 26. If I understand your brief comment there, you're saying that in a fixed interval of time, the baseline 5.9-rc did 100 runs, the patches up to and including 26/32 did 180 runs, then with 27/32 on top, did 198 runs? That's a good improvement by 27/32, but not essential for getting the patchset in: I don't think 27/32 is the right way to do it, so I'd still prefer to hold it back from the "initial offering". > > > > >> > >> I also tried per cpu solution but that cause much trouble of per cpu func > >> things, and looks no benefit except a bit struct size of stack, so if > >> stack size still fine. May we could use the solution and improve it better. > >> like, functionlize, fix the reverse issue etc. > > > > I don't know how important the stack depth consideration is nowadays: > > I still care, maybe others don't, since VMAP_STACK became an option. > > > > Yes, please fix the reversal (if I was right on that); and I expect > > you could use a singly linked list instead of the double. > > single linked list is more saving, but do we have to reverse walking to seek > the head or tail for correct sequence? I imagine all you need is to start off with a for (i = pagevec_count(pvec) - 1; i >= 0; i--) loop. > > > > > But I'll look for an alternative - later, once the urgent stuff > > is completed - and leave the acks on this patch to others. > > Ok, looking forward for your new solution! > > Thanks > Alex
On Thu, 10 Sep 2020, Alexander Duyck wrote: > On Wed, Sep 9, 2020 at 5:32 PM Hugh Dickins <hughd@google.com> wrote: > > On Wed, 9 Sep 2020, Alexander Duyck wrote: > > > On Tue, Sep 8, 2020 at 4:41 PM Hugh Dickins <hughd@google.com> wrote: > > > > [PATCH v18 28/32] mm/compaction: Drop locked from isolate_migratepages_block > > > > Most of this consists of replacing "locked" by "lruvec", which is good: > > > > but please fold those changes back into 20/32 (or would it be 17/32? > > > > I've not yet looked into the relationship between those two), so we > > > > can then see more clearly what change this 28/32 (will need renaming!) > > > > actually makes, to use lruvec_holds_page_lru_lock(). That may be a > > > > good change, but it's mixed up with the "locked"->"lruvec" at present, > > > > and I think you could have just used lruvec for locked all along > > > > (but of course there's a place where you'll need new_lruvec too). > > > > > > I am good with my patch being folded in. No need to keep it separate. > > > > Thanks. Though it was only the "locked"->"lruvec" changes I was > > suggesting to fold back, to minimize the diff, so that we could > > see your use of lruvec_holds_page_lru_lock() more clearly - you > > had not introduced that function at the stage of the earlier patches. > > > > But now that I stare at it again, using lruvec_holds_page_lru_lock() > > there doesn't look like an advantage to me: when it decides no, the > > same calculation is made all over again in mem_cgroup_page_lruvec(), > > whereas the code before only had to calculate it once. > > > > So, the code before looks better to me: I wonder, do you think that > > rcu_read_lock() is more expensive than I think it? There can be > > debug instrumentation that makes it heavier, but by itself it is > > very cheap (by design) - not worth branching around. > > Actually what I was more concerned with was the pointer chase that > required the RCU lock. With this function we are able to compare a > pair of pointers from the page and the lruvec and avoid the need for > the RCU lock. The way the old code was working we had to crawl through > the memcg to get to the lruvec before we could compare it to the one > we currently hold. The general idea is to use the data we have instead > of having to pull in some additional cache lines to perform the test. When you say "With this function...", I think you are referring to lruvec_holds_page_lru_lock(). Yes, I appreciate what you're doing there, making calculations from known-stable data, and taking it no further than the required comparison; and I think (I don't yet claim to have reviewed 21/32) what you do with it in relock_page_lruvec*() is an improvement over what we had there before. But here I'm talking about using it in isolate_migratepages_block() in 28/32: in this case, the code before evaluated the new lruvec, compared against the old, and immediately used the new lruvec if different; whereas using lruvec_holds_page_lru_lock() makes an almost (I agree not entirely, and I haven't counted cachelines) equivalent evaluation, but its results have to thrown away when it's false, then the new lruvec actually calculated and used. The same "results thrown away" criticism can be made of relock_page_lruvec*(), but what was done there before your rewrite in v18 was no better: they both resort to lock_page_lruvec*(page), working it all out again from page. And I'm not suggesting that be changed, not at this point anyway; but 28/32 looks to me like a regression from what was done there before 28/32. > > > > > > > > [PATCH v18 29/32] mm: Identify compound pages sooner in isolate_migratepages_block > > > > NAK. I agree that isolate_migratepages_block() looks nicer this way, but > > > > take a look at prep_new_page() in mm/page_alloc.c: post_alloc_hook() is > > > > where set_page_refcounted() changes page->_refcount from 0 to 1, allowing > > > > a racing get_page_unless_zero() to succeed; then later prep_compound_page() > > > > is where PageHead and PageTails get set. So there's a small race window in > > > > which this patch could deliver a compound page when it should not. > > > > > > So the main motivation for the patch was to avoid the case where we > > > are having to reset the LRU flag. > > > > That would be satisfying. Not necessary, but I agree satisfying. > > Maybe depends also on your "skip" change, which I've not looked at yet? > > My concern is that we have scenarios where isolate_migratepages_block > could possibly prevent another page from being able to isolate a page. > I'm mostly concerned with us potentially creating something like an > isolation leak if multiple threads are doing something like clearing > and then resetting the LRU flag. In my mind if we clear the LRU flag > we should be certain we are going to remove the page as otherwise > another thread would have done it if it would have been allowed > access. I agree it's nicer not to TestClearPageLRU unnecessarily; but if the occasional unnecessary TestClearPageLRU were really a concern, then there's a lot of more serious places to worry about - page reclaim is the great isolator that comes first to my mind. > > > > One question I would have is what if > > > we swapped the code block with the __isolate_lru_page_prepare section? > > > WIth that we would be taking a reference on the page, then verifying > > > the LRU flag is set, and then testing for compound page flag bit. > > > Would doing that close the race window since the LRU flag being set > > > should indicate that the allocation has already been completed has it > > > not? > > > > Yes, I think that would be safe, and would look better. But I am > > very hesitant to give snap assurances here (I've twice missed out > > a vital PageLRU check from this sequence myself): it is very easy > > to deceive myself and only see it later. > > I'm not looking for assurances, just sanity checks to make sure I am > not missing something obvious. > > > If you can see a bug in what's there before these patches, certainly > > we need to fix it. But adding non-essential patches to the already > > overlong series risks delaying it. > > My concern ends up being that if we are clearing the bit and restoring > it while holding the LRU lock we can effectively cause pages to become > pseudo-pinned on the LRU. In my mind I would want us to avoid clearing > the LRU flag until we know we are going to be pulling the page from > the list once we take the lruvec lock. I interpret clearing of the > flag to indicate the page has already been pulled, it just hasn't left > the list yet. With us resetting the bit we are violating that which I > worry will lead to issues. Your concern and my concern are different, but we are "on the same page". I've said repeatedly (to Alex) that I am not at ease with this TestClearPageLRU() technique: he has got it working, reliably, but I find it hard to reason about. Perhaps I'm just too used to what was there before, but clearing PageLRU and removing from LRU while holding lru_lock seems natural to me; whereas disconnecting them leaves us on shaky ground, adding comments and warnings about the peculiar races involved. And it adds a pair of atomic operations on each page in pagevec_lru_move_fn(), which were not needed before. I want us to go ahead with TestClearPageLRU, to get the series into mmotm and under wider testing. But if we accept the lock reordering in 06/32, then it becomes possible to replace those TestClearPageLRUs by lock_page_memcg()s: which in principle should be cheaper, but that will have to be measured. Hugh
On Wed, 9 Sep 2020, Alex Shi wrote: > 在 2020/9/9 上午7:41, Hugh Dickins 写道: > > > > The use of lock_page_memcg() in __munlock_pagevec() in 20/32, > > introduced in patchset v17, looks good but it isn't: I was lucky that > > systemd at reboot did some munlocking that exposed the problem to lockdep. > > The first time into the loop, lock_page_memcg() is done before lru_lock > > (as 06/32 has allowed); but the second time around the loop, it is done > > while still holding lru_lock. > > I don't know the details of lockdep show. Just wondering could it possible > to solid the move_lock/lru_lock sequence? > or try other blocking way which mentioned in commit_charge()? > > > > > lock_page_memcg() really needs to be absorbed into (a variant of) > > relock_page_lruvec(), and I do have that (it's awkward because of > > the different ways in which the IRQ flags are handled). And out of > > curiosity, I've also tried using that in mm/swap.c too, instead of the > > TestClearPageLRU technique: lockdep is happy, but an update_lru_size() > > warning showed that it cannot safely be mixed with the TestClearPageLRU > > technique (that I'd left in isolate_lru_page()). So I'll stash away > > that relock_page_lruvec(), and consider what's best for mm/mlock.c: > > now that I've posted these comments so far, that's my priority, then > > to get the result under testing again, before resuming these comments. > > No idea of your solution, but looking forward for your good news! :) Yes, it is good news, and simpler than anything suggested above. The main difficulties will probably be to look good in the 80 columns (I know that limit has been lifted recently, but some of us use xterms side by side), and to explain it. mm/mlock.c has not been kept up-to-date very well: and in particular, you have taken too seriously that "Serialize with any parallel __split_huge_page_refcount()..." comment that you updated to two comments "Serialize split tail pages in __split_huge_page_tail()...". Delete them! The original comment was by Vlastimil for v3.14 in 2014. But Kirill redesigned THP refcounting for v4.5 in 2016: that's when __split_huge_page_refcount() went away. And with the new refcounting, the THP splitting races that lru_lock protected munlock_vma_page() and __munlock_pagevec() from: those races have become impossible. Or maybe there never was such a race in __munlock_pagevec(): you have added the comment there, assuming lru_lock was for that purpose, but that was probably just the convenient place to take it, to cover all the del_page_from_lru()s. Observe how split_huge_page_to_list() uses unmap_page() to remove all pmds and all ptes for the huge page being split, and remap_page() only replaces the migration entries (used for anon but not for shmem or file) after doing all of the __split_huge_page_tail()s, before unlocking any of the pages. Recall that munlock_vma_page() and __munlock_pagevec() are being applied to pages found mapped into userspace, by ptes or pmd: there are none of those while __split_huge_page_tail() is being used, so no race to protect from. (Could a newly detached tail be freshly faulted into userspace just before __split_huge_page() has reached the head? Not quite, the fault has to wait to get the tail's page lock. But even if it could, how would that be a problem for __munlock_pagevec()?) There's lots more that could be said: for example, PageMlocked will always be clear on the THP head during __split_huge_page_tail(), because the last unmap of a PageMlocked page does clear_page_mlock(). But that's not required to prove the case, it's just another argument against the "Serialize" comment you have in __munlock_pagevec(). So, no need for the problematic lock_page_memcg(page) there in __munlock_pagevec(), nor to lock (or relock) lruvec just below it. __munlock_pagevec() still needs lru_lock to del_page_from_lru_list(), of course, but that must be done after your TestClearPageMlocked has stabilized page->memcg. Use relock_page_lruvec_irq() here? I suppose that will be easiest, but notice how __munlock_pagevec_fill() has already made sure that all the pages in the pagevec are from the same zone (and it cannot do the same for memcg without locking page memcg); so some of relock's work will be redundant. Otherwise, I'm much happier with your mm/mlock.c since looking at it in more detail: a couple of nits though - drop the clear_page_mlock() hunk from 25/32 - kernel style says do it the way you are undoing by - if (!isolate_lru_page(page)) { + if (!isolate_lru_page(page)) putback_lru_page(page); - } else { + else { I don't always follow that over-braced style when making changes, but you should not touch otherwise untouched code just to make it go against the approved style. And in munlock_vma_page(), - if (!TestClearPageMlocked(page)) { + if (!TestClearPageMlocked(page)) /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ - nr_pages = 1; - goto unlock_out; - } + return 0; please restore the braces: with that comment line in there, the compiler does not need the braces, but the human eye does. Hugh
在 2020/9/12 上午10:13, Hugh Dickins 写道: > On Fri, 11 Sep 2020, Alex Shi wrote: >> 在 2020/9/10 上午7:16, Hugh Dickins 写道: >>> On Wed, 9 Sep 2020, Alex Shi wrote: >>>> 在 2020/9/9 上午7:41, Hugh Dickins 写道: >>>>> >>>>> [PATCH v18 05/32] mm/thp: remove code path which never got into >>>>> This is a good simplification, but I see no sign that you understand >>>>> why it's valid: it relies on lru_add_page_tail() being called while >>>>> head refcount is frozen to 0: we would not get this far if someone >>>>> else holds a reference to the THP - which they must hold if they have >>>>> isolated the page from its lru (and that's true before or after your >>>>> per-memcg changes - but even truer after those changes, since PageLRU >>>>> can then be flipped without lru_lock at any instant): please explain >>>>> something of this in the commit message. >>>> >>>> Is the following commit log better? >>>> >>>> split_huge_page() will never call on a page which isn't on lru list, so >>>> this code never got a chance to run, and should not be run, to add tail >>>> pages on a lru list which head page isn't there. >>>> >>>> Hugh Dickins' mentioned: >>>> The path should never be called since lru_add_page_tail() being called >>>> while head refcount is frozen to 0: we would not get this far if someone >>>> else holds a reference to the THP - which they must hold if they have >>>> isolated the page from its lru. >>>> >>>> Although the bug was never triggered, it'better be removed for code >>>> correctness, and add a warn for unexpected calling. >>> >>> Not much better, no. split_huge_page() can easily be called for a page >>> which is not on the lru list at the time, >> >> Hi Hugh, >> >> Thanks for comments! >> >> There are some discussion on this point a couple of weeks ago, >> https://lkml.org/lkml/2020/7/9/760 >> >> Matthew Wilcox and Kirill have the following comments, >>> I don't understand how we get to split_huge_page() with a page that's >>> not on an LRU list. Both anonymous and page cache pages should be on >>> an LRU list. What am I missing? >> >> Right, and it's never got removed from LRU during the split. The tail >> pages have to be added to LRU because they now separate from the tail >> page. >> >> -- >> Kirill A. Shutemov > > Yes, those were among the mails that I read through before getting > down to review. I was surprised by their not understanding, but > it was a bit late to reply to that thread. > > Perhaps everybody had been focused on pages which have been and > naturally belong on an LRU list, rather than pages which are on > the LRU list at the instant that split_huge_page() is called. > > There are a number of places where PageLRU gets cleared, and a > number of places where we del_page_from_lru_list(), I think you'll > agree: your patches touch all or most of them. Let's think of a > common one, isolate_lru_pages() used by page reclaim, but the same > would apply to most of the others. > > Then there a number of places where split_huge_page() is called: > I am having difficulty finding any of those which cannot race with > page reclaim, but shall we choose anon THP's deferred_split_scan(), > or shmem THP's shmem_punch_compound()? > > What prevents either of those from calling split_huge_page() at > a time when isolate_lru_pages() has removed the page from LRU? > > But there's no problem in this race, because anyone isolating the > page from LRU must hold their own reference to the page (to prevent > it from being freed independently), and the can_split_huge_page() or > page_ref_freeze() in split_huge_page_to_list() will detect that and > fail the split with -EBUSY (or else succeed and prevent new references > from being acquired). So this case never reaches lru_add_page_tail(). Hi Hugh, Thanks for comments! We are the same page here, we all know split_huge_page_to_list could block them go futher and the code is functionality right. If the comments 'Split start from PageLRU(head), and ...' doesn't make things clear as it's should be, I am glad to see you rewrite and improve them. > >> >>> and I don't know what was the >>> bug which was never triggered. >> >> So the only path to the removed part should be a bug, like sth here, >> https://lkml.org/lkml/2020/7/10/118 >> or >> https://lkml.org/lkml/2020/7/10/972 > > Oh, the use of split_huge_page() in __iommu_dma_alloc_pages() is just > nonsense, I thought it had already been removed - perhaps some debate > over __GFP_COMP held it up. Not something you need worry about in > this patchset. > >> >>> Stick with whatever text you end up with >>> for the combination of 05/32 and 18/32, and I'll rewrite it after. >> >> I am not object to merge them into one, I just don't know how to say >> clear about 2 patches in commit log. As patch 18, TestClearPageLRU >> add the incorrect posibility of remove lru bit during split, that's >> the reason of code path rewrite and a WARN there. > > I did not know that was why you were putting 18/32 in at that > point, it does not mention TestClearPageLRU at all. But the fact > remains that it's a nice cleanup, contains a reassuring WARN if we > got it wrong (and I've suggested a WARN on the other branch too), > it was valid before your changes, and it's valid after your changes. > Please merge it back into the uglier 05/32, and again I'll rewrite > whatever comment you come up with if necessary. I merge them together on the following git branch, and let the commit log to you. :) https://github.com/alexshi/linux.git lruv19 > >>> >>>>> [PATCH v18 06/32] mm/thp: narrow lru locking >>>>> Why? What part does this play in the series? "narrow lru locking" can >>>>> also be described as "widen page cache locking": >>>> >>>> Uh, the page cache locking isn't widen, it's still on the old place. >>> >>> I'm not sure if you're joking there. Perhaps just a misunderstanding. >>> >>> Yes, patch 06/32 does not touch the xa_lock(&mapping->i_pages) and >>> xa_lock(&swap_cache->i_pages) lines (odd how we've arrived at two of >>> those, but please do not get into cleaning it up now); but it removes >>> the spin_lock_irqsave(&pgdata->lru_lock, flags) which used to come >>> before them, and inserts a spin_lock(&pgdat->lru_lock) after them. >>> >>> You call that narrowing the lru locking, okay, but I see it as also >>> pushing the page cache locking outwards: before this patch, page cache >>> lock was taken inside lru_lock; after this patch, page cache lock is >>> taken outside lru_lock. If you cannot see that, then I think you >>> should not have touched this code at all; but it's what we have >>> been testing, and I think we should go forward with it. >>> >>>>> But I wish you could give some reason for it in the commit message! >>>> >>>> It's a head scratch task. Would you like to tell me what's detailed info >>>> should be there? Thanks! >>> >>> So, you don't know why you did it either: then it will be hard to >>> justify. I guess I'll have to write something for it later. I'm >>> strongly tempted just to drop the patch, but expect it will become >>> useful later, for using lock_page_memcg() before getting lru_lock. >>> >> >> I thought the xa_lock and lru_lock relationship was described clear >> in the commit log, > > You say "lru_lock and page cache xa_lock have no reason with current > sequence", but you give no reason for inverting their sequence: > "let's" is not a reason. > >> and still no idea of the move_lock in the chain. > > memcg->move_lock is what's at the heart of lock_page_memcg(), but > as much as possible that tries to avoid the overhead of actually > taking it, since moving memcg is a rare operation. For lock ordering, > see the diagram in mm/rmap.c, which 23/32 updates to match this change. I see. thanks! > > Before this commit: lru_lock > move_lock > i_pages lock was the > expected lock ordering (but it looks as if the lru_lock > move_lock > requirement came from my per-memcg lru_lock patches). > > After this commit: move_lock > i_pages lock > lru_lock is the > required lock ordering, since there are strong reasons (in dirty > writeback) for move_lock > i_pages lock. > >> Please refill them for what I overlooked. > > Will do, but not before reviewing your remaining patches. IIRC, all of comments are accepted and push to https://github.com/alexshi/linux.git lruv19 If you don't minder, could you change everything and send out a new version for further review? > >> Thanks! >> >>>>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com> >>>>> Is that correct? Or Wei Yang suggested some part of it perhaps? >>>> >>>> Yes, we talked a lot to confirm the locking change is safe. >>> >>> Okay, but the patch was written by you, and sent by you to Andrew: >>> that is not a case for "Signed-off-by: Someone Else". >>> >> >> Ok. let's remove his signed-off. >> >>>>> [PATCH v18 27/32] mm/swap.c: optimizing __pagevec_lru_add lru_lock >>>>> Could we please drop this one for the moment? And come back to it later >>>>> when the basic series is safely in. It's a good idea to try sorting >>>>> together those pages which come under the same lock (though my guess is >>>>> that they naturally gather themselves together quite well already); but >>>>> I'm not happy adding 360 bytes to the kernel stack here (and that in >>>>> addition to 192 bytes of horrid pseudo-vma in the shmem swapin case), >>>>> though that could be avoided by making it per-cpu. But I hope there's >>>>> a simpler way of doing it, as efficient, but also useful for the other >>>>> pagevec operations here: perhaps scanning the pagevec for same page-> >>>>> mem_cgroup (and flags node bits), NULLing entries as they are done. >>>>> Another, easily fixed, minor defect in this patch: if I'm reading it >>>>> right, it reverses the order in which the pages are put on the lru? >>>> >>>> this patch could give about 10+% performance gain on my multiple memcg >>>> readtwice testing. fairness locking cost the performance much. >>> >>> Good to know, should have been mentioned. s/fairness/Repeated/ >>> >>> But what was the gain or loss on your multiple memcg readtwice >>> testing without this patch, compared against node-only lru_lock? >>> The 80% gain mentioned before, I presume. So this further >>> optimization can wait until the rest is solid. >> >> the gain based on the patch 26. > > If I understand your brief comment there, you're saying that > in a fixed interval of time, the baseline 5.9-rc did 100 runs, > the patches up to and including 26/32 did 180 runs, then with > 27/32 on top, did 198 runs? Uh, I updated the testing with some new results here: https://lkml.org/lkml/2020/8/26/212 > > That's a good improvement by 27/32, but not essential for getting > the patchset in: I don't think 27/32 is the right way to do it, > so I'd still prefer to hold it back from the "initial offering". I am ok to hold it back. > >> >>> >>>> >>>> I also tried per cpu solution but that cause much trouble of per cpu func >>>> things, and looks no benefit except a bit struct size of stack, so if >>>> stack size still fine. May we could use the solution and improve it better. >>>> like, functionlize, fix the reverse issue etc. >>> >>> I don't know how important the stack depth consideration is nowadays: >>> I still care, maybe others don't, since VMAP_STACK became an option. >>> >>> Yes, please fix the reversal (if I was right on that); and I expect >>> you could use a singly linked list instead of the double. >> >> single linked list is more saving, but do we have to reverse walking to seek >> the head or tail for correct sequence? > > I imagine all you need is to start off with a > for (i = pagevec_count(pvec) - 1; i >= 0; i--) a nice simple solution, thanks! Thanks alex > loop. > >> >>> >>> But I'll look for an alternative - later, once the urgent stuff >>> is completed - and leave the acks on this patch to others. >> >> Ok, looking forward for your new solution! >> >> Thanks >> Alex
在 2020/9/12 下午4:38, Hugh Dickins 写道: > On Wed, 9 Sep 2020, Alex Shi wrote: >> 在 2020/9/9 上午7:41, Hugh Dickins 写道: >>> >>> The use of lock_page_memcg() in __munlock_pagevec() in 20/32, >>> introduced in patchset v17, looks good but it isn't: I was lucky that >>> systemd at reboot did some munlocking that exposed the problem to lockdep. >>> The first time into the loop, lock_page_memcg() is done before lru_lock >>> (as 06/32 has allowed); but the second time around the loop, it is done >>> while still holding lru_lock. >> >> I don't know the details of lockdep show. Just wondering could it possible >> to solid the move_lock/lru_lock sequence? >> or try other blocking way which mentioned in commit_charge()? >> >>> >>> lock_page_memcg() really needs to be absorbed into (a variant of) >>> relock_page_lruvec(), and I do have that (it's awkward because of >>> the different ways in which the IRQ flags are handled). And out of >>> curiosity, I've also tried using that in mm/swap.c too, instead of the >>> TestClearPageLRU technique: lockdep is happy, but an update_lru_size() >>> warning showed that it cannot safely be mixed with the TestClearPageLRU >>> technique (that I'd left in isolate_lru_page()). So I'll stash away >>> that relock_page_lruvec(), and consider what's best for mm/mlock.c: >>> now that I've posted these comments so far, that's my priority, then >>> to get the result under testing again, before resuming these comments. >> >> No idea of your solution, but looking forward for your good news! :) > > Yes, it is good news, and simpler than anything suggested above. Awesome! > > The main difficulties will probably be to look good in the 80 columns > (I know that limit has been lifted recently, but some of us use xterms > side by side), and to explain it. > > mm/mlock.c has not been kept up-to-date very well: and in particular, > you have taken too seriously that "Serialize with any parallel > __split_huge_page_refcount()..." comment that you updated to two > comments "Serialize split tail pages in __split_huge_page_tail()...". > > Delete them! The original comment was by Vlastimil for v3.14 in 2014. > But Kirill redesigned THP refcounting for v4.5 in 2016: that's when > __split_huge_page_refcount() went away. And with the new refcounting, > the THP splitting races that lru_lock protected munlock_vma_page() > and __munlock_pagevec() from: those races have become impossible. > > Or maybe there never was such a race in __munlock_pagevec(): you > have added the comment there, assuming lru_lock was for that purpose, > but that was probably just the convenient place to take it, > to cover all the del_page_from_lru()s. > > Observe how split_huge_page_to_list() uses unmap_page() to remove > all pmds and all ptes for the huge page being split, and remap_page() > only replaces the migration entries (used for anon but not for shmem > or file) after doing all of the __split_huge_page_tail()s, before > unlocking any of the pages. Recall that munlock_vma_page() and > __munlock_pagevec() are being applied to pages found mapped > into userspace, by ptes or pmd: there are none of those while > __split_huge_page_tail() is being used, so no race to protect from. > > (Could a newly detached tail be freshly faulted into userspace just > before __split_huge_page() has reached the head? Not quite, the > fault has to wait to get the tail's page lock. But even if it > could, how would that be a problem for __munlock_pagevec()?) > > There's lots more that could be said: for example, PageMlocked will > always be clear on the THP head during __split_huge_page_tail(), > because the last unmap of a PageMlocked page does clear_page_mlock(). > But that's not required to prove the case, it's just another argument > against the "Serialize" comment you have in __munlock_pagevec(). > > So, no need for the problematic lock_page_memcg(page) there in > __munlock_pagevec(), nor to lock (or relock) lruvec just below it. > __munlock_pagevec() still needs lru_lock to del_page_from_lru_list(), > of course, but that must be done after your TestClearPageMlocked has > stabilized page->memcg. Use relock_page_lruvec_irq() here? I suppose > that will be easiest, but notice how __munlock_pagevec_fill() has > already made sure that all the pages in the pagevec are from the same > zone (and it cannot do the same for memcg without locking page memcg); > so some of relock's work will be redundant. It sounds reasonable for me. > > Otherwise, I'm much happier with your mm/mlock.c since looking at it > in more detail: a couple of nits though - drop the clear_page_mlock() > hunk from 25/32 - kernel style says do it the way you are undoing by > - if (!isolate_lru_page(page)) { > + if (!isolate_lru_page(page)) > putback_lru_page(page); > - } else { > + else { > I don't always follow that over-braced style when making changes, > but you should not touch otherwise untouched code just to make it > go against the approved style. And in munlock_vma_page(), > - if (!TestClearPageMlocked(page)) { > + if (!TestClearPageMlocked(page)) > /* Potentially, PTE-mapped THP: do not skip the rest PTEs */ > - nr_pages = 1; > - goto unlock_out; > - } > + return 0; > please restore the braces: with that comment line in there, > the compiler does not need the braces, but the human eye does. Yes, That's better to keep the brace there. Thanks Alex
On Sun, 13 Sep 2020, Alex Shi wrote: > > IIRC, all of comments are accepted and push to > https://github.com/alexshi/linux.git lruv19 I just had to relax for the weekend, so no progress from me. I'll take a look at your tree tomorrow, er, later today. > If you don't minder, could you change everything and send out a new version > for further review? Sorry, no. Tiresome though it is for both of us, I'll continue to send you comments, and leave all the posting to you. > Uh, I updated the testing with some new results here: > https://lkml.org/lkml/2020/8/26/212 Right, I missed that, that's better, thanks. Any other test results? Hugh
On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: > On Sun, 13 Sep 2020, Alex Shi wrote: > > Uh, I updated the testing with some new results here: > > https://lkml.org/lkml/2020/8/26/212 > > Right, I missed that, that's better, thanks. Any other test results? Alex, you were doing some will-it-scale runs earlier. Are you planning to do more of those? Otherwise I can add them in. This is what I have so far. sysbench oltp read-only ----------------------- The goal was to run a real world benchmark, at least more so than something like vm-scalability, with the memory controller enabled but unused to check for regressions. I chose sysbench because it was relatively straightforward to run, but I'm open to ideas for other high level benchmarks that might be more sensitive to this series. CoeffVar shows the test was pretty noisy overall. It's nice to see there's no significant difference between the kernels for low thread counts (1-12), but I'm not sure what to make of the 18 and 20 thread cases. At 20 threads, the CPUs of the node that the test was confined to were saturated and the variance is especially high. I'm tempted to write the 18 and 20 thread cases off as noise. - 2-socket * 10-core * 2-hyperthread broadwell server - test bound to node 1 to lower variance - 251G memory, divided evenly between the nodes (memory size of test shrunk to accommodate confining to one node) - 12 iterations per thread count per kernel - THP enabled export OLTP_CACHESIZE=$(($MEMTOTAL_BYTES/4)) export OLTP_SHAREDBUFFERS=$((MEMTOTAL_BYTES/8)) export OLTP_PAGESIZES="default" export SYSBENCH_DRIVER=postgres export SYSBENCH_MAX_TRANSACTIONS=auto export SYSBENCH_READONLY=yes export SYSBENCH_MAX_THREADS=$((NUMCPUS / 2)) export SYSBENCH_ITERATIONS=12 export SYSBENCH_WORKLOAD_SIZE=$((MEMTOTAL_BYTES*3/8)) export SYSBENCH_CACHE_COLD=no export DATABASE_INIT_ONCE=yes export MMTESTS_NUMA_POLICY=fullbind_single_instance_node numactl --cpunodebind=1 --membind=1 <mmtests_cmdline> sysbench Transactions per second 5.9-rc2 5.9-rc2-lru-v18 Min 1 593.23 ( 0.00%) 583.37 ( -1.66%) Min 4 1897.34 ( 0.00%) 1871.77 ( -1.35%) Min 7 2471.14 ( 0.00%) 2449.77 ( -0.86%) Min 12 2680.00 ( 0.00%) 2853.25 ( 6.46%) Min 18 2183.82 ( 0.00%) 1191.43 ( -45.44%) Min 20 924.96 ( 0.00%) 526.66 ( -43.06%) Hmean 1 912.08 ( 0.00%) 904.24 ( -0.86%) Hmean 4 2057.11 ( 0.00%) 2044.69 ( -0.60%) Hmean 7 2817.59 ( 0.00%) 2812.80 ( -0.17%) Hmean 12 3201.05 ( 0.00%) 3171.09 ( -0.94%) Hmean 18 2529.10 ( 0.00%) 2009.99 * -20.53%* Hmean 20 1742.29 ( 0.00%) 1127.77 * -35.27%* Stddev 1 219.21 ( 0.00%) 220.92 ( -0.78%) Stddev 4 94.94 ( 0.00%) 84.34 ( 11.17%) Stddev 7 189.42 ( 0.00%) 167.58 ( 11.53%) Stddev 12 372.13 ( 0.00%) 199.40 ( 46.42%) Stddev 18 248.42 ( 0.00%) 574.66 (-131.32%) Stddev 20 757.69 ( 0.00%) 666.87 ( 11.99%) CoeffVar 1 22.54 ( 0.00%) 22.86 ( -1.42%) CoeffVar 4 4.61 ( 0.00%) 4.12 ( 10.60%) CoeffVar 7 6.69 ( 0.00%) 5.94 ( 11.30%) CoeffVar 12 11.49 ( 0.00%) 6.27 ( 45.46%) CoeffVar 18 9.74 ( 0.00%) 26.22 (-169.23%) CoeffVar 20 36.32 ( 0.00%) 47.18 ( -29.89%) Max 1 1117.45 ( 0.00%) 1107.33 ( -0.91%) Max 4 2184.92 ( 0.00%) 2136.65 ( -2.21%) Max 7 3086.81 ( 0.00%) 3049.52 ( -1.21%) Max 12 4020.07 ( 0.00%) 3580.95 ( -10.92%) Max 18 3032.30 ( 0.00%) 2810.85 ( -7.30%) Max 20 2891.27 ( 0.00%) 2675.80 ( -7.45%) BHmean-50 1 1098.77 ( 0.00%) 1093.58 ( -0.47%) BHmean-50 4 2139.76 ( 0.00%) 2107.13 ( -1.52%) BHmean-50 7 2972.18 ( 0.00%) 2953.94 ( -0.61%) BHmean-50 12 3494.73 ( 0.00%) 3311.33 ( -5.25%) BHmean-50 18 2729.70 ( 0.00%) 2606.32 ( -4.52%) BHmean-50 20 2668.72 ( 0.00%) 1779.87 ( -33.31%) BHmean-95 1 958.94 ( 0.00%) 951.84 ( -0.74%) BHmean-95 4 2072.98 ( 0.00%) 2062.01 ( -0.53%) BHmean-95 7 2853.96 ( 0.00%) 2851.21 ( -0.10%) BHmean-95 12 3258.65 ( 0.00%) 3203.53 ( -1.69%) BHmean-95 18 2565.99 ( 0.00%) 2143.90 ( -16.45%) BHmean-95 20 1894.47 ( 0.00%) 1258.34 ( -33.58%) BHmean-99 1 958.94 ( 0.00%) 951.84 ( -0.74%) BHmean-99 4 2072.98 ( 0.00%) 2062.01 ( -0.53%) BHmean-99 7 2853.96 ( 0.00%) 2851.21 ( -0.10%) BHmean-99 12 3258.65 ( 0.00%) 3203.53 ( -1.69%) BHmean-99 18 2565.99 ( 0.00%) 2143.90 ( -16.45%) BHmean-99 20 1894.47 ( 0.00%) 1258.34 ( -33.58%) sysbench Time 5.9-rc2 5.9-rc2-lru Min 1 8.96 ( 0.00%) 9.04 ( -0.89%) Min 4 4.63 ( 0.00%) 4.74 ( -2.38%) Min 7 3.34 ( 0.00%) 3.38 ( -1.20%) Min 12 2.65 ( 0.00%) 2.95 ( -11.32%) Min 18 3.54 ( 0.00%) 3.80 ( -7.34%) Min 20 3.74 ( 0.00%) 4.02 ( -7.49%) Amean 1 11.00 ( 0.00%) 11.11 ( -0.98%) Amean 4 4.92 ( 0.00%) 4.95 ( -0.59%) Amean 7 3.65 ( 0.00%) 3.65 ( -0.16%) Amean 12 3.29 ( 0.00%) 3.32 ( -0.89%) Amean 18 4.20 ( 0.00%) 5.22 * -24.39%* Amean 20 6.02 ( 0.00%) 9.14 * -51.98%* Stddev 1 3.33 ( 0.00%) 3.45 ( -3.40%) Stddev 4 0.23 ( 0.00%) 0.21 ( 7.89%) Stddev 7 0.25 ( 0.00%) 0.22 ( 9.87%) Stddev 12 0.35 ( 0.00%) 0.19 ( 45.09%) Stddev 18 0.38 ( 0.00%) 1.75 (-354.74%) Stddev 20 2.93 ( 0.00%) 4.73 ( -61.72%) CoeffVar 1 30.30 ( 0.00%) 31.02 ( -2.40%) CoeffVar 4 4.63 ( 0.00%) 4.24 ( 8.43%) CoeffVar 7 6.77 ( 0.00%) 6.10 ( 10.02%) CoeffVar 12 10.74 ( 0.00%) 5.85 ( 45.57%) CoeffVar 18 9.15 ( 0.00%) 33.45 (-265.58%) CoeffVar 20 48.64 ( 0.00%) 51.75 ( -6.41%) Max 1 17.01 ( 0.00%) 17.36 ( -2.06%) Max 4 5.33 ( 0.00%) 5.40 ( -1.31%) Max 7 4.14 ( 0.00%) 4.18 ( -0.97%) Max 12 3.89 ( 0.00%) 3.67 ( 5.66%) Max 18 4.82 ( 0.00%) 8.64 ( -79.25%) Max 20 11.09 ( 0.00%) 19.26 ( -73.67%) BAmean-50 1 9.12 ( 0.00%) 9.16 ( -0.49%) BAmean-50 4 4.73 ( 0.00%) 4.80 ( -1.55%) BAmean-50 7 3.46 ( 0.00%) 3.48 ( -0.58%) BAmean-50 12 3.02 ( 0.00%) 3.18 ( -5.24%) BAmean-50 18 3.90 ( 0.00%) 4.08 ( -4.52%) BAmean-50 20 4.02 ( 0.00%) 5.90 ( -46.56%) BAmean-95 1 10.45 ( 0.00%) 10.54 ( -0.82%) BAmean-95 4 4.88 ( 0.00%) 4.91 ( -0.52%) BAmean-95 7 3.60 ( 0.00%) 3.60 ( -0.08%) BAmean-95 12 3.23 ( 0.00%) 3.28 ( -1.60%) BAmean-95 18 4.14 ( 0.00%) 4.91 ( -18.58%) BAmean-95 20 5.56 ( 0.00%) 8.22 ( -48.04%) BAmean-99 1 10.45 ( 0.00%) 10.54 ( -0.82%) BAmean-99 4 4.88 ( 0.00%) 4.91 ( -0.52%) BAmean-99 7 3.60 ( 0.00%) 3.60 ( -0.08%) BAmean-99 12 3.23 ( 0.00%) 3.28 ( -1.60%) BAmean-99 18 4.14 ( 0.00%) 4.91 ( -18.58%) BAmean-99 20 5.56 ( 0.00%) 8.22 ( -48.04%) docker-ized readtwice microbenchmark ------------------------------------ This is Alex's modified readtwice case. Needed a few fixes, and I made it into a script. Updated version attached. Same machine, three runs per kernel, 40 containers per test. This is average MB/s over all containers. 5.9-rc2 5.9-rc2-lru ----------- ----------- 220.5 (3.3) 356.9 (0.5) That's a 62% improvement. FROM centos:8 MAINTAINER Alexs #WORKDIR /vm-scalability #RUN yum update -y && yum groupinstall "Development Tools" -y && yum clean all && \ #examples https://www.linuxtechi.com/build-docker-container-images-with-dockerfile/ RUN yum install git xfsprogs patch make gcc -y && yum clean all && \ git clone https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ && \ cd vm-scalability && make usemem COPY readtwice.patch /vm-scalability/ RUN cd vm-scalability && patch -p1 < readtwice.patch #!/usr/bin/env bash # # Originally by Alex Shi <alex.shi@linux.alibaba.com> # Changes from Daniel Jordan <daniel.m.jordan@oracle.com> SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" TAG='lrulock' runtime=300 nr_cont=$(nproc) cd "$SCRIPT_DIR" echo -e "starting $nr_cont containers\n" pids=() sudo docker build -t "$TAG" . nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l) if (( nr_running_cont != 0 )); then echo "error: $nr_running_cont containers already running" exit 1 fi # start some testing containers for ((i=0; i < nr_cont; i++)); do sudo docker run --privileged=true --rm "$TAG" bash -c "sleep infinity" & done nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l) until (( nr_running_cont == nr_cont )); do sleep .5 nr_running_cont=$(sudo docker ps | sed '1 d' | wc -l) done # do testing evn setup for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do sudo docker exec --privileged=true -t $i \ bash -c "cd /vm-scalability/; bash ./case-lru-file-readtwice m" & pids+=($!) done wait "${pids[@]}" pids=() # kick testing for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do sudo docker exec --privileged=true -t -e runtime=$runtime $i \ bash -c "cd /vm-scalability/; bash ./case-lru-file-readtwice r" & pids+=($!) done wait "${pids[@]}" pids=() # save results ts=$(date +%y-%m-%d_%H:%M:%S) f="$ts/summary.txt" mkdir "$ts" echo "$ts" >> "$f" uname -r >> "$f" for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do sudo docker exec $i bash -c 'cat /tmp/vm-scalability-tmp/dd-output-*' &> "$ts/$i.out" & pids+=($!) done wait "${pids[@]}" pids=() grep 'copied' "$ts"/*.out | \ awk 'BEGIN {a=0;} { a+=$10 } END {print NR, a/(NR)}' | \ tee -a "$f" for i in `sudo docker ps | sed '1 d' | awk '{print $1}'`; do sudo docker stop $i &>/dev/null & done wait echo 'test finished' echo diff --git a/case-lru-file-readtwice b/case-lru-file-readtwice index 85533b248634..57cb97d121ae 100755 --- a/case-lru-file-readtwice +++ b/case-lru-file-readtwice @@ -15,23 +15,30 @@ . ./hw_vars -for i in `seq 1 $nr_task` -do - create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES / nr_task)) - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1 & - timeout --foreground -s INT ${runtime:-600} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1 & -done +OUT_DIR=$(hostname)-${nr_task}c-$(((mem + (1<<29))>>30))g +TEST_CASES=${@:-$(echo case-*)} + +echo $((1<<30)) > /proc/sys/vm/max_map_count +echo $((1<<20)) > /proc/sys/kernel/threads-max +echo 1 > /proc/sys/vm/overcommit_memory +#echo 3 > /proc/sys/vm/drop_caches + + +i=1 + +if [ "$1" == "m" ];then + mount_tmpfs + create_sparse_root + create_sparse_file $SPARSE_FILE-$i $((ROTATE_BYTES)) + exit +fi + + +if [ "$1" == "r" ];then + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-1-$i 2>&1)& + (timeout --foreground -s INT ${runtime:-300} dd bs=4k if=$SPARSE_FILE-$i of=/dev/null > $TMPFS_MNT/dd-output-2-$i 2>&1)& +fi wait sleep 1 -for file in $TMPFS_MNT/dd-output-* -do - [ -s "$file" ] || { - echo "dd output file empty: $file" >&2 - } - cat $file - rm $file -done - -rm `seq -f $SPARSE_FILE-%g 1 $nr_task` diff --git a/hw_vars b/hw_vars index 8731cefb9f57..ceeaa9f17c0b 100755 --- a/hw_vars +++ b/hw_vars @@ -1,4 +1,4 @@ -#!/bin/sh +#!/bin/sh -e if [ -n "$runtime" ]; then USEMEM="$CMD ./usemem --runtime $runtime" @@ -43,7 +43,7 @@ create_loop_devices() modprobe loop 2>/dev/null [ -e "/dev/loop0" ] || modprobe loop 2>/dev/null - for i in $(seq 0 8) + for i in $(seq 0 104) do [ -e "/dev/loop$i" ] && continue mknod /dev/loop$i b 7 $i
在 2020/9/16 上午12:58, Daniel Jordan 写道: > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: >> On Sun, 13 Sep 2020, Alex Shi wrote: >>> Uh, I updated the testing with some new results here: >>> https://lkml.org/lkml/2020/8/26/212 >> >> Right, I missed that, that's better, thanks. Any other test results? > > Alex, you were doing some will-it-scale runs earlier. Are you planning to do > more of those? Otherwise I can add them in. > Hi Daniel, I am happy to see your testing result. :) > This is what I have so far. > > > sysbench oltp read-only > ----------------------- > > The goal was to run a real world benchmark, at least more so than something > like vm-scalability, with the memory controller enabled but unused to check for > regressions. > > I chose sysbench because it was relatively straightforward to run, but I'm open > to ideas for other high level benchmarks that might be more sensitive to this > series. > > CoeffVar shows the test was pretty noisy overall. It's nice to see there's no > significant difference between the kernels for low thread counts (1-12), but > I'm not sure what to make of the 18 and 20 thread cases. At 20 threads, the > CPUs of the node that the test was confined to were saturated and the variance > is especially high. I'm tempted to write the 18 and 20 thread cases off as > noise. > > - 2-socket * 10-core * 2-hyperthread broadwell server > - test bound to node 1 to lower variance > - 251G memory, divided evenly between the nodes (memory size of test shrunk to > accommodate confining to one node) > - 12 iterations per thread count per kernel > - THP enabled Thanks a lot for the results! Alex > > export OLTP_CACHESIZE=$(($MEMTOTAL_BYTES/4)) > export OLTP_SHAREDBUFFERS=$((MEMTOTAL_BYTES/8)) > export OLTP_PAGESIZES="default" > export SYSBENCH_DRIVER=postgres > export SYSBENCH_MAX_TRANSACTIONS=auto > export SYSBENCH_READONLY=yes > export SYSBENCH_MAX_THREADS=$((NUMCPUS / 2)) > export SYSBENCH_ITERATIONS=12 > export SYSBENCH_WORKLOAD_SIZE=$((MEMTOTAL_BYTES*3/8)) > export SYSBENCH_CACHE_COLD=no > export DATABASE_INIT_ONCE=yes > > export MMTESTS_NUMA_POLICY=fullbind_single_instance_node > numactl --cpunodebind=1 --membind=1 <mmtests_cmdline> > > sysbench Transactions per second > 5.9-rc2 5.9-rc2-lru-v18 > Min 1 593.23 ( 0.00%) 583.37 ( -1.66%) > Min 4 1897.34 ( 0.00%) 1871.77 ( -1.35%) > Min 7 2471.14 ( 0.00%) 2449.77 ( -0.86%) > Min 12 2680.00 ( 0.00%) 2853.25 ( 6.46%) > Min 18 2183.82 ( 0.00%) 1191.43 ( -45.44%) > Min 20 924.96 ( 0.00%) 526.66 ( -43.06%) > Hmean 1 912.08 ( 0.00%) 904.24 ( -0.86%) > Hmean 4 2057.11 ( 0.00%) 2044.69 ( -0.60%) > Hmean 7 2817.59 ( 0.00%) 2812.80 ( -0.17%) > Hmean 12 3201.05 ( 0.00%) 3171.09 ( -0.94%) > Hmean 18 2529.10 ( 0.00%) 2009.99 * -20.53%* > Hmean 20 1742.29 ( 0.00%) 1127.77 * -35.27%* > Stddev 1 219.21 ( 0.00%) 220.92 ( -0.78%) > Stddev 4 94.94 ( 0.00%) 84.34 ( 11.17%) > Stddev 7 189.42 ( 0.00%) 167.58 ( 11.53%) > Stddev 12 372.13 ( 0.00%) 199.40 ( 46.42%) > Stddev 18 248.42 ( 0.00%) 574.66 (-131.32%) > Stddev 20 757.69 ( 0.00%) 666.87 ( 11.99%) > CoeffVar 1 22.54 ( 0.00%) 22.86 ( -1.42%) > CoeffVar 4 4.61 ( 0.00%) 4.12 ( 10.60%) > CoeffVar 7 6.69 ( 0.00%) 5.94 ( 11.30%) > CoeffVar 12 11.49 ( 0.00%) 6.27 ( 45.46%) > CoeffVar 18 9.74 ( 0.00%) 26.22 (-169.23%) > CoeffVar 20 36.32 ( 0.00%) 47.18 ( -29.89%) > Max 1 1117.45 ( 0.00%) 1107.33 ( -0.91%) > Max 4 2184.92 ( 0.00%) 2136.65 ( -2.21%) > Max 7 3086.81 ( 0.00%) 3049.52 ( -1.21%) > Max 12 4020.07 ( 0.00%) 3580.95 ( -10.92%) > Max 18 3032.30 ( 0.00%) 2810.85 ( -7.30%) > Max 20 2891.27 ( 0.00%) 2675.80 ( -7.45%) > BHmean-50 1 1098.77 ( 0.00%) 1093.58 ( -0.47%) > BHmean-50 4 2139.76 ( 0.00%) 2107.13 ( -1.52%) > BHmean-50 7 2972.18 ( 0.00%) 2953.94 ( -0.61%) > BHmean-50 12 3494.73 ( 0.00%) 3311.33 ( -5.25%) > BHmean-50 18 2729.70 ( 0.00%) 2606.32 ( -4.52%) > BHmean-50 20 2668.72 ( 0.00%) 1779.87 ( -33.31%) > BHmean-95 1 958.94 ( 0.00%) 951.84 ( -0.74%) > BHmean-95 4 2072.98 ( 0.00%) 2062.01 ( -0.53%) > BHmean-95 7 2853.96 ( 0.00%) 2851.21 ( -0.10%) > BHmean-95 12 3258.65 ( 0.00%) 3203.53 ( -1.69%) > BHmean-95 18 2565.99 ( 0.00%) 2143.90 ( -16.45%) > BHmean-95 20 1894.47 ( 0.00%) 1258.34 ( -33.58%) > BHmean-99 1 958.94 ( 0.00%) 951.84 ( -0.74%) > BHmean-99 4 2072.98 ( 0.00%) 2062.01 ( -0.53%) > BHmean-99 7 2853.96 ( 0.00%) 2851.21 ( -0.10%) > BHmean-99 12 3258.65 ( 0.00%) 3203.53 ( -1.69%) > BHmean-99 18 2565.99 ( 0.00%) 2143.90 ( -16.45%) > BHmean-99 20 1894.47 ( 0.00%) 1258.34 ( -33.58%) > > sysbench Time > 5.9-rc2 5.9-rc2-lru > Min 1 8.96 ( 0.00%) 9.04 ( -0.89%) > Min 4 4.63 ( 0.00%) 4.74 ( -2.38%) > Min 7 3.34 ( 0.00%) 3.38 ( -1.20%) > Min 12 2.65 ( 0.00%) 2.95 ( -11.32%) > Min 18 3.54 ( 0.00%) 3.80 ( -7.34%) > Min 20 3.74 ( 0.00%) 4.02 ( -7.49%) > Amean 1 11.00 ( 0.00%) 11.11 ( -0.98%) > Amean 4 4.92 ( 0.00%) 4.95 ( -0.59%) > Amean 7 3.65 ( 0.00%) 3.65 ( -0.16%) > Amean 12 3.29 ( 0.00%) 3.32 ( -0.89%) > Amean 18 4.20 ( 0.00%) 5.22 * -24.39%* > Amean 20 6.02 ( 0.00%) 9.14 * -51.98%* > Stddev 1 3.33 ( 0.00%) 3.45 ( -3.40%) > Stddev 4 0.23 ( 0.00%) 0.21 ( 7.89%) > Stddev 7 0.25 ( 0.00%) 0.22 ( 9.87%) > Stddev 12 0.35 ( 0.00%) 0.19 ( 45.09%) > Stddev 18 0.38 ( 0.00%) 1.75 (-354.74%) > Stddev 20 2.93 ( 0.00%) 4.73 ( -61.72%) > CoeffVar 1 30.30 ( 0.00%) 31.02 ( -2.40%) > CoeffVar 4 4.63 ( 0.00%) 4.24 ( 8.43%) > CoeffVar 7 6.77 ( 0.00%) 6.10 ( 10.02%) > CoeffVar 12 10.74 ( 0.00%) 5.85 ( 45.57%) > CoeffVar 18 9.15 ( 0.00%) 33.45 (-265.58%) > CoeffVar 20 48.64 ( 0.00%) 51.75 ( -6.41%) > Max 1 17.01 ( 0.00%) 17.36 ( -2.06%) > Max 4 5.33 ( 0.00%) 5.40 ( -1.31%) > Max 7 4.14 ( 0.00%) 4.18 ( -0.97%) > Max 12 3.89 ( 0.00%) 3.67 ( 5.66%) > Max 18 4.82 ( 0.00%) 8.64 ( -79.25%) > Max 20 11.09 ( 0.00%) 19.26 ( -73.67%) > BAmean-50 1 9.12 ( 0.00%) 9.16 ( -0.49%) > BAmean-50 4 4.73 ( 0.00%) 4.80 ( -1.55%) > BAmean-50 7 3.46 ( 0.00%) 3.48 ( -0.58%) > BAmean-50 12 3.02 ( 0.00%) 3.18 ( -5.24%) > BAmean-50 18 3.90 ( 0.00%) 4.08 ( -4.52%) > BAmean-50 20 4.02 ( 0.00%) 5.90 ( -46.56%) > BAmean-95 1 10.45 ( 0.00%) 10.54 ( -0.82%) > BAmean-95 4 4.88 ( 0.00%) 4.91 ( -0.52%) > BAmean-95 7 3.60 ( 0.00%) 3.60 ( -0.08%) > BAmean-95 12 3.23 ( 0.00%) 3.28 ( -1.60%) > BAmean-95 18 4.14 ( 0.00%) 4.91 ( -18.58%) > BAmean-95 20 5.56 ( 0.00%) 8.22 ( -48.04%) > BAmean-99 1 10.45 ( 0.00%) 10.54 ( -0.82%) > BAmean-99 4 4.88 ( 0.00%) 4.91 ( -0.52%) > BAmean-99 7 3.60 ( 0.00%) 3.60 ( -0.08%) > BAmean-99 12 3.23 ( 0.00%) 3.28 ( -1.60%) > BAmean-99 18 4.14 ( 0.00%) 4.91 ( -18.58%) > BAmean-99 20 5.56 ( 0.00%) 8.22 ( -48.04%) > > > docker-ized readtwice microbenchmark > ------------------------------------ > > This is Alex's modified readtwice case. Needed a few fixes, and I made it into > a script. Updated version attached. > > Same machine, three runs per kernel, 40 containers per test. This is average > MB/s over all containers. > > 5.9-rc2 5.9-rc2-lru > ----------- ----------- > 220.5 (3.3) 356.9 (0.5) > > That's a 62% improvement. >
在 2020/9/16 上午12:58, Daniel Jordan 写道: > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: >> On Sun, 13 Sep 2020, Alex Shi wrote: >>> Uh, I updated the testing with some new results here: >>> https://lkml.org/lkml/2020/8/26/212 >> Right, I missed that, that's better, thanks. Any other test results? > Alex, you were doing some will-it-scale runs earlier. Are you planning to do > more of those? Otherwise I can add them in. Hi Daniel, Does compaction perf scalable, like thpscale, I except they could get some benefit. Thanks Alex
On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote: > 在 2020/9/16 上午12:58, Daniel Jordan 写道: > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: > >> On Sun, 13 Sep 2020, Alex Shi wrote: > >>> Uh, I updated the testing with some new results here: > >>> https://lkml.org/lkml/2020/8/26/212 > >> Right, I missed that, that's better, thanks. Any other test results? > > Alex, you were doing some will-it-scale runs earlier. Are you planning to do > > more of those? Otherwise I can add them in. > > Hi Daniel, > > Does compaction perf scalable, like thpscale, I except they could get some benefit. Yep, I plan to stress compaction. Reclaim as well. I should have said which Alex I meant. I was asking Alex Duyck since he'd done some will-it-scale runs.
On Thu, Sep 17, 2020 at 7:26 AM Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote: > > 在 2020/9/16 上午12:58, Daniel Jordan 写道: > > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: > > >> On Sun, 13 Sep 2020, Alex Shi wrote: > > >>> Uh, I updated the testing with some new results here: > > >>> https://lkml.org/lkml/2020/8/26/212 > > >> Right, I missed that, that's better, thanks. Any other test results? > > > Alex, you were doing some will-it-scale runs earlier. Are you planning to do > > > more of those? Otherwise I can add them in. > > > > Hi Daniel, > > > > Does compaction perf scalable, like thpscale, I except they could get some benefit. > > Yep, I plan to stress compaction. Reclaim as well. > > I should have said which Alex I meant. I was asking Alex Duyck since he'd done > some will-it-scale runs. I probably won't be able to do any will-it-scale runs any time soon. If I recall I ran them for this latest v18 patch set and didn't see any regressions like I did with the previous set. However the system I was using is tied up for other purposes and it may be awhile before I can free it up to look into this again. Thanks. - Alex
On Thu, Sep 17, 2020 at 08:39:34AM -0700, Alexander Duyck wrote: > On Thu, Sep 17, 2020 at 7:26 AM Daniel Jordan > <daniel.m.jordan@oracle.com> wrote: > > > > On Thu, Sep 17, 2020 at 10:37:45AM +0800, Alex Shi wrote: > > > 在 2020/9/16 上午12:58, Daniel Jordan 写道: > > > > On Tue, Sep 15, 2020 at 01:21:56AM -0700, Hugh Dickins wrote: > > > >> On Sun, 13 Sep 2020, Alex Shi wrote: > > > >>> Uh, I updated the testing with some new results here: > > > >>> https://lkml.org/lkml/2020/8/26/212 > > > >> Right, I missed that, that's better, thanks. Any other test results? > > > > Alex, you were doing some will-it-scale runs earlier. Are you planning to do > > > > more of those? Otherwise I can add them in. > > > > > > Hi Daniel, > > > > > > Does compaction perf scalable, like thpscale, I except they could get some benefit. > > > > Yep, I plan to stress compaction. Reclaim as well. > > > > I should have said which Alex I meant. I was asking Alex Duyck since he'd done > > some will-it-scale runs. > > I probably won't be able to do any will-it-scale runs any time soon. > If I recall I ran them for this latest v18 patch set and didn't see > any regressions like I did with the previous set. However the system I > was using is tied up for other purposes and it may be awhile before I > can free it up to look into this again. Ok, sure. I hadn't seen the regressions were taken case of, that's good to hear. Might still add them to my testing for v19 and beyond, we'll see.