Message ID | 1592555636-115095-1-git-send-email-alex.shi@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | per memcg lru lock | expand |
On Fri, 19 Jun 2020 16:33:38 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > This is a new version which bases on linux-next, merged much suggestion > from Hugh Dickins, from compaction fix to less TestClearPageLRU and > comments reverse etc. Thank a lot, Hugh! > > Johannes Weiner has suggested: > "So here is a crazy idea that may be worth exploring: > > Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's > linked list. > > Can we make PageLRU atomic and use it to stabilize the lru_lock > instead, and then use the lru_lock only serialize list operations? I don't understand this sentence. How can a per-page flag stabilize a per-pgdat spinlock? Perhaps some additional description will help. > ..." > > With new memcg charge path and this solution, we could isolate > LRU pages to exclusive visit them in compaction, page migration, reclaim, > memcg move_accunt, huge page split etc scenarios while keeping pages' > memcg stable. Then possible to change per node lru locking to per memcg > lru locking. As to pagevec_lru_move_fn funcs, it would be safe to let > pages remain on lru list, lru lock could guard them for list integrity. > > The patchset includes 3 parts: > 1, some code cleanup and minimum optimization as a preparation. > 2, use TestCleanPageLRU as page isolation's precondition > 3, replace per node lru_lock with per memcg per node lru_lock > > The 3rd part moves per node lru_lock into lruvec, thus bring a lru_lock for > each of memcg per node. So on a large machine, each of memcg don't > have to suffer from per node pgdat->lru_lock competition. They could go > fast with their self lru_lock > > Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 > containers on a 2s * 26cores * HT box with a modefied case: > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice > > With this patchset, the readtwice performance increased about 80% > in concurrent containers. > > Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this > idea 8 years ago, and others who give comments as well: Daniel Jordan, > Mel Gorman, Shakeel Butt, Matthew Wilcox etc. > > Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, > and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks! > > ... > > 24 files changed, 500 insertions(+), 357 deletions(-) It's a large patchset and afaict the whole point is performance gain. 80% in one specialized test sounds nice, but is there a plan for more extensive quantification? There isn't much sign of completed review activity here, so I'll go into hiding for a while.
在 2020/6/21 上午7:08, Andrew Morton 写道: > On Fri, 19 Jun 2020 16:33:38 +0800 Alex Shi <alex.shi@linux.alibaba.com> wrote: > >> This is a new version which bases on linux-next, merged much suggestion >> from Hugh Dickins, from compaction fix to less TestClearPageLRU and >> comments reverse etc. Thank a lot, Hugh! >> >> Johannes Weiner has suggested: >> "So here is a crazy idea that may be worth exploring: >> >> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's >> linked list. >> >> Can we make PageLRU atomic and use it to stabilize the lru_lock >> instead, and then use the lru_lock only serialize list operations? > > I don't understand this sentence. How can a per-page flag stabilize a > per-pgdat spinlock? Perhaps some additional description will help. Hi Andrew, Well, above comments miss a context, which lru_lock means new lru_lock on each of memcg not the current per node lru_lock. Sorry! Currently the lru bit changed under lru_lock, so isolate a page from lru just need take lru_lock. New patch will change it with a atomic action alone from lru_lock, so isolate a page need both actions: TestClearPageLRU and take the lru_lock. like followings in isolate_lru_page(): The main reason for this comes from isolate_migratepages_block() in compaction.c we have to take lru bit before lru lock, that serialized the page isolation in memcg page charge/migration which will change page's lruvec and new lru_lock in it. The current isolation just take lru lock directly which fails on guard page's lruvec change(memcg change). changes in isolate_lru_page():- if (PageLRU(page)) { + if (TestClearPageLRU(page)) { pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; + int lru = page_lru(page); - spin_lock_irq(&pgdat->lru_lock); + get_page(page); lruvec = mem_cgroup_page_lruvec(page, pgdat); - if (PageLRU(page)) { - int lru = page_lru(page); - get_page(page); - ClearPageLRU(page); - del_page_from_lru_list(page, lruvec, lru); - ret = 0; - } + spin_lock_irq(&pgdat->lru_lock); + del_page_from_lru_list(page, lruvec, lru); spin_unlock_irq(&pgdat->lru_lock); + ret = 0; } > >> >> Following Daniel Jordan's suggestion, I have run 208 'dd' with on 104 >> containers on a 2s * 26cores * HT box with a modefied case: >> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice >> >> With this patchset, the readtwice performance increased about 80% >> in concurrent containers. >> >> Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought this >> idea 8 years ago, and others who give comments as well: Daniel Jordan, >> Mel Gorman, Shakeel Butt, Matthew Wilcox etc. >> >> Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu, >> and Yun Wang. Hugh Dickins also shared his kbuild-swap case. Thanks! >> >> ... >> >> 24 files changed, 500 insertions(+), 357 deletions(-) > > It's a large patchset and afaict the whole point is performance gain. > 80% in one specialized test sounds nice, but is there a plan for more > extensive quantification? Once I got 5% aim7 performance gain on 16 cores machine, and about 20+% readtwice performance gain. the performance gain is increased a lot following larger cores. Is there some suggestion for this? > > There isn't much sign of completed review activity here, so I'll go > into hiding for a while. > Yes, it's relatively big. also much of change from comments part. :) Anyway, thanks for look into! Thanks Alex