Message ID | 20240425131724.36778-3-shikemeng@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix and cleanups to page-writeback | expand |
On Thu 25-04-24 21:17:22, Kemeng Shi wrote: > The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in > global domain. To calculate wb's share of bg_thresh in cgroup domain, > it's more reasonable to use __wb_calc_thresh in which way we calculate > dirty_thresh in cgroup domain in balance_dirty_pages(). > > Consider following domain hierarchy: > global domain (> 20G) > / \ > cgroup domain1(10G) cgroup domain2(10G) > | | > bdi wb1 wb2 > Assume wb1 and wb2 has the same bandwidth. > We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G. > Then we have: > wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth) > = 2G * 1/2 = 1G > wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth) > = 1G * 1/2 = 0.5G > At last, wb1 and wb2 will be limited at 0.5G, the system will be limited > at 1G which is less than global domain bg_thresh 2G. This was a bit hard to understand for me so I'd rephrase it as: wb_calc_thresh() is calculating wb's share of bg_thresh in the global domain. However in case of cgroup writeback this is not the right thing to do. Consider the following domain hierarchy: global domain (> 20G) / \ cgroup1 (10G) cgroup2 (10G) | | bdi wb1 wb2 and assume wb1 and wb2 have the same bandwidth and the background threshold is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb threshold in the global domain as (wb bandwidth) / (domain bandwidth) it returns bg_thresh for wb1 as 0.5G although it has nobody to compete against in cgroup1. Fix the problem by calculating wb's share of bg_thresh in the cgroup domain. > Test as following: > /* make it easier to observe the issue */ > echo 300000 > /proc/sys/vm/dirty_expire_centisecs > echo 100 > /proc/sys/vm/dirty_writeback_centisecs > > /* run fio in wb1 */ > cd /sys/fs/cgroup > echo "+memory +io" > cgroup.subtree_control > mkdir group1 > cd group1 > echo 10G > memory.high > echo 10G > memory.max > echo $$ > cgroup.procs > mkfs.ext4 -F /dev/vdb > mount /dev/vdb /bdi1/ > fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ > -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 > > /* run fio in wb2 with a new shell */ > cd /sys/fs/cgroup > mkdir group2 > cd group2 > echo 10G > memory.high > echo 10G > memory.max > echo $$ > cgroup.procs > mkfs.ext4 -F /dev/vdc > mount /dev/vdc /bdi2/ > fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ > -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 > > Before fix, the wrttien pages of wb1 and wb2 reported from > toos/writeback/wb_monitor.py keep growing. After fix, rare written pages > are accumulated. > There is no obvious change in fio result. > > Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") > Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Besides the changelog rephrasing the change looks good. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > mm/page-writeback.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2a3b68aae336..14893b20d38c 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) > if (mdtc->dirty > mdtc->bg_thresh) > return true; > > - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh); > + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh); > if (thresh < 2 * wb_stat_error()) > reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); > else > -- > 2.30.0 >
Hi Jan, on 5/3/2024 5:30 PM, Jan Kara wrote: > On Thu 25-04-24 21:17:22, Kemeng Shi wrote: >> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in >> global domain. To calculate wb's share of bg_thresh in cgroup domain, >> it's more reasonable to use __wb_calc_thresh in which way we calculate >> dirty_thresh in cgroup domain in balance_dirty_pages(). >> >> Consider following domain hierarchy: >> global domain (> 20G) >> / \ >> cgroup domain1(10G) cgroup domain2(10G) >> | | >> bdi wb1 wb2 >> Assume wb1 and wb2 has the same bandwidth. >> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G. >> Then we have: >> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth) >> = 2G * 1/2 = 1G >> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth) >> = 1G * 1/2 = 0.5G >> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited >> at 1G which is less than global domain bg_thresh 2G. > > This was a bit hard to understand for me so I'd rephrase it as: > > wb_calc_thresh() is calculating wb's share of bg_thresh in the global > domain. However in case of cgroup writeback this is not the right thing to > do. Consider the following domain hierarchy: > > global domain (> 20G) > / \ > cgroup1 (10G) cgroup2 (10G) > | | > bdi wb1 wb2 > > and assume wb1 and wb2 have the same bandwidth and the background threshold > is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now > because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb > threshold in the global domain as (wb bandwidth) / (domain bandwidth) it > returns bg_thresh for wb1 as 0.5G although it has nobody to compete against > in cgroup1. > > Fix the problem by calculating wb's share of bg_thresh in the cgroup > domain. Thanks for improving the changelog. As this was merged into -mm and mm-unstable tree, I'm not sure if a new patch is needed. If there is anything I should do, please let me konw. Thanks. > >> Test as following: >> /* make it easier to observe the issue */ >> echo 300000 > /proc/sys/vm/dirty_expire_centisecs >> echo 100 > /proc/sys/vm/dirty_writeback_centisecs >> >> /* run fio in wb1 */ >> cd /sys/fs/cgroup >> echo "+memory +io" > cgroup.subtree_control >> mkdir group1 >> cd group1 >> echo 10G > memory.high >> echo 10G > memory.max >> echo $$ > cgroup.procs >> mkfs.ext4 -F /dev/vdb >> mount /dev/vdb /bdi1/ >> fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ >> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 >> >> /* run fio in wb2 with a new shell */ >> cd /sys/fs/cgroup >> mkdir group2 >> cd group2 >> echo 10G > memory.high >> echo 10G > memory.max >> echo $$ > cgroup.procs >> mkfs.ext4 -F /dev/vdc >> mount /dev/vdc /bdi2/ >> fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ >> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 >> >> Before fix, the wrttien pages of wb1 and wb2 reported from >> toos/writeback/wb_monitor.py keep growing. After fix, rare written pages >> are accumulated. >> There is no obvious change in fio result. >> >> Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> > > Besides the changelog rephrasing the change looks good. Feel free to add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > Honza > >> --- >> mm/page-writeback.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> index 2a3b68aae336..14893b20d38c 100644 >> --- a/mm/page-writeback.c >> +++ b/mm/page-writeback.c >> @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) >> if (mdtc->dirty > mdtc->bg_thresh) >> return true; >> >> - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh); >> + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh); >> if (thresh < 2 * wb_stat_error()) >> reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); >> else >> -- >> 2.30.0 >>
On Tue 07-05-24 09:16:39, Kemeng Shi wrote: > > Hi Jan, > on 5/3/2024 5:30 PM, Jan Kara wrote: > > On Thu 25-04-24 21:17:22, Kemeng Shi wrote: > >> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in > >> global domain. To calculate wb's share of bg_thresh in cgroup domain, > >> it's more reasonable to use __wb_calc_thresh in which way we calculate > >> dirty_thresh in cgroup domain in balance_dirty_pages(). > >> > >> Consider following domain hierarchy: > >> global domain (> 20G) > >> / \ > >> cgroup domain1(10G) cgroup domain2(10G) > >> | | > >> bdi wb1 wb2 > >> Assume wb1 and wb2 has the same bandwidth. > >> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G. > >> Then we have: > >> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth) > >> = 2G * 1/2 = 1G > >> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth) > >> = 1G * 1/2 = 0.5G > >> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited > >> at 1G which is less than global domain bg_thresh 2G. > > > > This was a bit hard to understand for me so I'd rephrase it as: > > > > wb_calc_thresh() is calculating wb's share of bg_thresh in the global > > domain. However in case of cgroup writeback this is not the right thing to > > do. Consider the following domain hierarchy: > > > > global domain (> 20G) > > / \ > > cgroup1 (10G) cgroup2 (10G) > > | | > > bdi wb1 wb2 > > > > and assume wb1 and wb2 have the same bandwidth and the background threshold > > is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now > > because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb > > threshold in the global domain as (wb bandwidth) / (domain bandwidth) it > > returns bg_thresh for wb1 as 0.5G although it has nobody to compete against > > in cgroup1. > > > > Fix the problem by calculating wb's share of bg_thresh in the cgroup > > domain. > Thanks for improving the changelog. As this was merged into -mm and > mm-unstable tree, I'm not sure if a new patch is needed. If there is > anything I should do, please let me konw. Thanks. No need to do anything here. Andrew has picked up these updates. Honza
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2a3b68aae336..14893b20d38c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb) if (mdtc->dirty > mdtc->bg_thresh) return true; - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh); + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh); if (thresh < 2 * wb_stat_error()) reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); else
The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in global domain. To calculate wb's share of bg_thresh in cgroup domain, it's more reasonable to use __wb_calc_thresh in which way we calculate dirty_thresh in cgroup domain in balance_dirty_pages(). Consider following domain hierarchy: global domain (> 20G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 Assume wb1 and wb2 has the same bandwidth. We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G. Then we have: wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth) = 2G * 1/2 = 1G wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth) = 1G * 1/2 = 0.5G At last, wb1 and wb2 will be limited at 0.5G, the system will be limited at 1G which is less than global domain bg_thresh 2G. Test as following: /* make it easier to observe the issue */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* run fio in wb1 */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 /* run fio in wb2 with a new shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 Before fix, the wrttien pages of wb1 and wb2 reported from toos/writeback/wb_monitor.py keep growing. After fix, rare written pages are accumulated. There is no obvious change in fio result. Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> --- mm/page-writeback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)