From patchwork Thu May 23 14:24:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Aaron Lu X-Patchwork-Id: 10957773 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 06A4613AD for ; Thu, 23 May 2019 14:24:48 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DA9A72861C for ; Thu, 23 May 2019 14:24:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CEA7928608; Thu, 23 May 2019 14:24:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AC7C1285ED for ; Thu, 23 May 2019 14:24:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77EB06B0003; Thu, 23 May 2019 10:24:44 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 72FDA6B0005; Thu, 23 May 2019 10:24:44 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F74E6B0006; Thu, 23 May 2019 10:24:44 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-it1-f200.google.com (mail-it1-f200.google.com [209.85.166.200]) by kanga.kvack.org (Postfix) with ESMTP id 357786B0003 for ; Thu, 23 May 2019 10:24:44 -0400 (EDT) Received: by mail-it1-f200.google.com with SMTP id m188so5504897ita.0 for ; Thu, 23 May 2019 07:24:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:date:from:to :cc:subject:message-id:mime-version:content-disposition:user-agent; bh=8PJ1UVNjXtMIvYKCC66cLGiTdJNAVd4zIy4FKi1qJKI=; b=qx0fKT+dcYiGe8slB6ZudtetoljWdQrDtTpO2plPO7VSdVLOrZyFyHnwEEaVctQzh7 9DBQEdS7gxYAbxpMzs+EPkeBS4BSj/MbUN3iaq/s72AQ+c5Cy3XtdPsicZb0CthT0GNn 3rRSFZhXClc85ZtFKxh6SBZBJznkkWGHD92+XIkx299jPfgS3I3rBvCgmdJiaPrFRmFr CINplAeORoPx4gobcHiUXxfbNmRSOSFtjesBijmEzJBbFP7q3YhxHxrB0L5s+lxRR/Qh 9XWXHQXPDwDNRG6SweSwxDVSOfuIrnUYzstO8/XbkUkWDeTn/z8yMs1OjQg5kN20YTj8 akzA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of aaron.lu@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=aaron.lu@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAUjA6KGhubxvzkYA4Y2qLUWP5wc1DUvY1GAIzWFi+VzPU0eAzxx 6K781nsvm2O6Lwj576qLE+moEpwGXDqgvqdIrb/G47jpXbcG8tpsdUR9CuhTaf7GWAW8tgzvEes j0bHUjWYGZ3kwP4C1ALNevX2VANwgXOM6kHf4r0xUDSAF/QP9UgRB79QsTKUJujK+3w== X-Received: by 2002:a24:47cc:: with SMTP id t195mr12890420itb.117.1558621483873; Thu, 23 May 2019 07:24:43 -0700 (PDT) X-Google-Smtp-Source: APXvYqwJPTCrWV6Ov9clqpkgUaXJ89taE3BIbgfAsj72Lr6MBReTdEGkE6tR48oDHxDke1WGf+D6 X-Received: by 2002:a24:47cc:: with SMTP id t195mr12890318itb.117.1558621482348; Thu, 23 May 2019 07:24:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558621482; cv=none; d=google.com; s=arc-20160816; b=TMjCg+R3oEpUAmZpK9+02B/8/3+hZb65CCUJhEbY5N85TzdPTf9E3ffpJd7d4jzXTS l9V5i0yA88CFY9AkCcBElHXFY6SJ8WHt7LkzdfbMnT9aJ/5bfNauut5+itOawMEni0r7 fxgCszZMiAqPyOxSC4OU1GQJHzIFKXAqtsFsZmF0EkTo490e5vhOZbwF1Aa+BHLLZpy3 nWdjd+OJoIqUHHvPQ+4iB7J6JRfOg+LTJdWJPTrU7QcmLggNivfMRoZW0QkmlnvPZ9ym PA26CEz6VbMurESmptx4pBu9xNZaKdkZuSvgcYPkpehcvfC0QDgmc+xyIoCnG23HQ02x wfqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:content-disposition:mime-version:message-id:subject:cc :to:from:date; bh=8PJ1UVNjXtMIvYKCC66cLGiTdJNAVd4zIy4FKi1qJKI=; b=h2ZuI3kxTAC50d1vW7bckuvL48FguVZ2DGDiFMTcPw7m6Sz2b1cVZ9gIwAtEkmbHHH sQmLcUAt3gh1fKZ8ntc1Dza7H4nFITfaD93o2L3bnnDb7S4It+9J2AjihE1aVCgjyvAv ZxsEf0wfe7lvlFXxoEFPt12Ysilgq6hfMU8dH+s7asUC0jkZ6VmY0Uowr6GBOUxLVqUQ PnUUT8aDiUKnJq+GjCArpadwuMbUSKujmYMS0A1Pm0VI1lUMjn1Azb6tVIPvFqMGs8+/ PtBYF6q7P8Cp1Y+3h6uY8GbQj9/C8TAhp7tEBU+6sOk+GC9XEqOy8RRfEYw4G5rskQ00 uTrw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of aaron.lu@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=aaron.lu@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com. [115.124.30.133]) by mx.google.com with ESMTPS id m44si3410639iti.132.2019.05.23.07.24.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 23 May 2019 07:24:42 -0700 (PDT) Received-SPF: pass (google.com: domain of aaron.lu@linux.alibaba.com designates 115.124.30.133 as permitted sender) client-ip=115.124.30.133; Authentication-Results: mx.google.com; spf=pass (google.com: domain of aaron.lu@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=aaron.lu@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R241e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04394;MF=aaron.lu@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0TSTuuDA_1558621456; Received: from aaronlu(mailfrom:aaron.lu@linux.alibaba.com fp:SMTPD_---0TSTuuDA_1558621456) by smtp.aliyun-inc.com(127.0.0.1); Thu, 23 May 2019 22:24:24 +0800 Date: Thu, 23 May 2019 22:24:15 +0800 From: Aaron Lu To: linux-mm@kvack.org Cc: Andrew Morton , Huang Ying Subject: [PATCH] mm, swap: use rbtree for swap_extent Message-ID: <20190523142404.GA181@aaronlu> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.9.4 (2018-02-28) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Aaron Lu swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. Signed-off-by: Aaron Lu --- include/linux/swap.h | 5 +- mm/page_io.c | 2 +- mm/swapfile.c | 137 +++++++++++++++++++++++-------------------- 3 files changed, 78 insertions(+), 66 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4bfb5c4ac108..a229c6273781 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -148,7 +148,7 @@ struct zone; * We always assume that blocks are of size PAGE_SIZE. */ struct swap_extent { - struct list_head list; + struct rb_node rb_node; pgoff_t start_page; pgoff_t nr_pages; sector_t start_block; @@ -247,8 +247,7 @@ struct swap_info_struct { unsigned int cluster_next; /* likely index for next allocation */ unsigned int cluster_nr; /* countdown to next cluster search */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ - struct swap_extent *curr_swap_extent; - struct swap_extent first_swap_extent; + struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ unsigned int old_block_size; /* seldom referenced */ diff --git a/mm/page_io.c b/mm/page_io.c index 2e8019d0e048..24ad2d43c11d 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -164,7 +164,7 @@ int generic_swapfile_activate(struct swap_info_struct *sis, blocks_per_page = PAGE_SIZE >> blkbits; /* - * Map all the blocks into the extent list. This code doesn't try + * Map all the blocks into the extent tree. This code doesn't try * to be very smart. */ probe_block = 0; diff --git a/mm/swapfile.c b/mm/swapfile.c index 596ac98051c5..82b96750ad45 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -152,6 +152,18 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, return ret; } +static inline struct swap_extent *first_se(struct swap_info_struct *sis) +{ + struct rb_node *rb = rb_first(&sis->swap_extent_root); + return rb_entry(rb, struct swap_extent, rb_node); +} + +static inline struct swap_extent *next_se(struct swap_extent *se) +{ + struct rb_node *rb = rb_next(&se->rb_node); + return rb ? rb_entry(rb, struct swap_extent, rb_node) : NULL; +} + /* * swapon tell device that all the old swap contents can be discarded, * to allow the swap device to optimize its wear-levelling. @@ -164,7 +176,7 @@ static int discard_swap(struct swap_info_struct *si) int err = 0; /* Do not discard the swap header page! */ - se = &si->first_swap_extent; + se = first_se(si); start_block = (se->start_block + 1) << (PAGE_SHIFT - 9); nr_blocks = ((sector_t)se->nr_pages - 1) << (PAGE_SHIFT - 9); if (nr_blocks) { @@ -175,7 +187,7 @@ static int discard_swap(struct swap_info_struct *si) cond_resched(); } - list_for_each_entry(se, &si->first_swap_extent.list, list) { + for (se = next_se(se); se; se = next_se(se)) { start_block = se->start_block << (PAGE_SHIFT - 9); nr_blocks = (sector_t)se->nr_pages << (PAGE_SHIFT - 9); @@ -189,6 +201,26 @@ static int discard_swap(struct swap_info_struct *si) return err; /* That will often be -EOPNOTSUPP */ } +static struct swap_extent * +offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset) +{ + struct swap_extent *se; + struct rb_node *rb; + + rb = sis->swap_extent_root.rb_node; + while (rb) { + se = rb_entry(rb, struct swap_extent, rb_node); + if (offset < se->start_page) + rb = rb->rb_left; + else if (offset >= se->start_page + se->nr_pages) + rb = rb->rb_right; + else + return se; + } + /* It *must* be present */ + BUG_ON(1); +} + /* * swap allocation tell device that a cluster of swap can now be discarded, * to allow the swap device to optimize its wear-levelling. @@ -196,32 +228,25 @@ static int discard_swap(struct swap_info_struct *si) static void discard_swap_cluster(struct swap_info_struct *si, pgoff_t start_page, pgoff_t nr_pages) { - struct swap_extent *se = si->curr_swap_extent; - int found_extent = 0; + struct swap_extent *se = offset_to_swap_extent(si, start_page); while (nr_pages) { - if (se->start_page <= start_page && - start_page < se->start_page + se->nr_pages) { - pgoff_t offset = start_page - se->start_page; - sector_t start_block = se->start_block + offset; - sector_t nr_blocks = se->nr_pages - offset; - - if (nr_blocks > nr_pages) - nr_blocks = nr_pages; - start_page += nr_blocks; - nr_pages -= nr_blocks; - - if (!found_extent++) - si->curr_swap_extent = se; - - start_block <<= PAGE_SHIFT - 9; - nr_blocks <<= PAGE_SHIFT - 9; - if (blkdev_issue_discard(si->bdev, start_block, - nr_blocks, GFP_NOIO, 0)) - break; - } + pgoff_t offset = start_page - se->start_page; + sector_t start_block = se->start_block + offset; + sector_t nr_blocks = se->nr_pages - offset; + + if (nr_blocks > nr_pages) + nr_blocks = nr_pages; + start_page += nr_blocks; + nr_pages -= nr_blocks; + + start_block <<= PAGE_SHIFT - 9; + nr_blocks <<= PAGE_SHIFT - 9; + if (blkdev_issue_discard(si->bdev, start_block, + nr_blocks, GFP_NOIO, 0)) + break; - se = list_next_entry(se, list); + se = next_se(se); } } @@ -1684,7 +1709,7 @@ int swap_type_of(dev_t device, sector_t offset, struct block_device **bdev_p) return type; } if (bdev == sis->bdev) { - struct swap_extent *se = &sis->first_swap_extent; + struct swap_extent *se = first_se(sis); if (se->start_block == offset) { if (bdev_p) @@ -2161,7 +2186,6 @@ static void drain_mmlist(void) static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev) { struct swap_info_struct *sis; - struct swap_extent *start_se; struct swap_extent *se; pgoff_t offset; @@ -2169,18 +2193,8 @@ static sector_t map_swap_entry(swp_entry_t entry, struct block_device **bdev) *bdev = sis->bdev; offset = swp_offset(entry); - start_se = sis->curr_swap_extent; - se = start_se; - - for ( ; ; ) { - if (se->start_page <= offset && - offset < (se->start_page + se->nr_pages)) { - return se->start_block + (offset - se->start_page); - } - se = list_next_entry(se, list); - sis->curr_swap_extent = se; - BUG_ON(se == start_se); /* It *must* be present */ - } + se = offset_to_swap_extent(sis, offset); + return se->start_block + (offset - se->start_page); } /* @@ -2198,12 +2212,11 @@ sector_t map_swap_page(struct page *page, struct block_device **bdev) */ static void destroy_swap_extents(struct swap_info_struct *sis) { - while (!list_empty(&sis->first_swap_extent.list)) { - struct swap_extent *se; + while (!RB_EMPTY_ROOT(&sis->swap_extent_root)) { + struct rb_node *rb = sis->swap_extent_root.rb_node; + struct swap_extent *se = rb_entry(rb, struct swap_extent, rb_node); - se = list_first_entry(&sis->first_swap_extent.list, - struct swap_extent, list); - list_del(&se->list); + rb_erase(rb, &sis->swap_extent_root); kfree(se); } @@ -2219,7 +2232,7 @@ static void destroy_swap_extents(struct swap_info_struct *sis) /* * Add a block range (and the corresponding page range) into this swapdev's - * extent list. The extent list is kept sorted in page order. + * extent tree. * * This function rather assumes that it is called in ascending page order. */ @@ -2227,20 +2240,21 @@ int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, unsigned long nr_pages, sector_t start_block) { + struct rb_node **link = &sis->swap_extent_root.rb_node, *parent = NULL; struct swap_extent *se; struct swap_extent *new_se; - struct list_head *lh; - - if (start_page == 0) { - se = &sis->first_swap_extent; - sis->curr_swap_extent = se; - se->start_page = 0; - se->nr_pages = nr_pages; - se->start_block = start_block; - return 1; - } else { - lh = sis->first_swap_extent.list.prev; /* Highest extent */ - se = list_entry(lh, struct swap_extent, list); + + /* + * place the new node at the right most since the + * function is called in ascending page order. + */ + while (*link) { + parent = *link; + link = &parent->rb_right; + } + + if (parent) { + se = rb_entry(parent, struct swap_extent, rb_node); BUG_ON(se->start_page + se->nr_pages != start_page); if (se->start_block + se->nr_pages == start_block) { /* Merge it */ @@ -2249,9 +2263,7 @@ add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, } } - /* - * No merge. Insert a new extent, preserving ordering. - */ + /* No merge, insert a new extent. */ new_se = kmalloc(sizeof(*se), GFP_KERNEL); if (new_se == NULL) return -ENOMEM; @@ -2259,7 +2271,8 @@ add_swap_extent(struct swap_info_struct *sis, unsigned long start_page, new_se->nr_pages = nr_pages; new_se->start_block = start_block; - list_add_tail(&new_se->list, &sis->first_swap_extent.list); + rb_link_node(&new_se->rb_node, parent, link); + rb_insert_color(&new_se->rb_node, &sis->swap_extent_root); return 1; } EXPORT_SYMBOL_GPL(add_swap_extent); @@ -2749,7 +2762,7 @@ static struct swap_info_struct *alloc_swap_info(void) * would be relying on p->type to remain valid. */ } - INIT_LIST_HEAD(&p->first_swap_extent.list); + p->swap_extent_root = RB_ROOT; plist_node_init(&p->list, 0); for_each_node(i) plist_node_init(&p->avail_lists[i], 0);