From patchwork Mon Dec  3 10:25:32 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nikolay Borisov <nborisov@suse.com>
X-Patchwork-Id: 10709147
Return-Path: <linux-btrfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60F0B15A6
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Mon,  3 Dec 2018 10:25:40 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 509FA2B0DF
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Mon,  3 Dec 2018 10:25:40 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 44B112B0E1; Mon,  3 Dec 2018 10:25:40 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 77DF12B0DF
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
 Mon,  3 Dec 2018 10:25:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726157AbeLCK0D (ORCPT
        <rfc822;patchwork-linux-btrfs@patchwork.kernel.org>);
        Mon, 3 Dec 2018 05:26:03 -0500
Received: from mx2.suse.de ([195.135.220.15]:43420 "EHLO mx1.suse.de"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1726045AbeLCK0D (ORCPT <rfc822;linux-btrfs@vger.kernel.org>);
        Mon, 3 Dec 2018 05:26:03 -0500
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay1.suse.de (unknown [195.135.220.254])
        by mx1.suse.de (Postfix) with ESMTP id E03F2AC17
        for <linux-btrfs@vger.kernel.org>;
 Mon,  3 Dec 2018 10:25:34 +0000 (UTC)
From: Nikolay Borisov <nborisov@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Nikolay Borisov <nborisov@suse.com>
Subject: [RFC PATCH] btrfs: Remove __extent_readpages
Date: Mon,  3 Dec 2018 12:25:32 +0200
Message-Id: <20181203102532.13945-1-nborisov@suse.com>
X-Mailer: git-send-email 2.17.1
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

When extent_readpages is called from the generic readahead code it first
builds a batch of 16 pages (which might or might not be consecutive,
depending on whether add_to_page_cache_lru failed) and submits them to
__extent_readpages. The latter ensures that the range of pages (in the
batch of 16) that is passed to __do_contiguous_readpages is consecutive.

If add_to_page_cache_lru does't fail then __extent_readpages will call
__do_contiguous_readpages only once with the whole batch of 16.
Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example)
then the contigous page read code will be called twice.

All of this can be simplified by exploiting the fact that all pages passed
to extent_readpages are consecutive, thus when the batch is built in
that function it is already consecutive (barring add_to_page_cache_lru
failures) so are ready to be submitted directly to __do_contiguous_readpages.
Also simplify the name of the function to contiguous_readpages. 

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---

So this patch looks like a very nice cleanup, however when doing performance 
measurements with fio I was shocked to see that it actually is detrimental to 
performance. Here are the results: 

The command line used for fio: 
fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
 --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10

This was tested on a vm with 4g of ram so the size of the test is smaller than 
the memory, so pages should have been nicely readahead. 

PATCHED: 

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 09:57:17 2018
  read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
    clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
     lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
    clat percentiles (nsec):
     |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
     | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
     | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
     | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
     | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
     | 99.95th=[ 667648], 99.99th=[5931008]
   bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50
   iops        : min=  692, max=136135, avg=102478.08, stdev=40583.47, samples=50
  lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
  cpu          : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec


UNPATCHED:

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 10:07:38 2018
  read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
    clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
     lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
    clat percentiles (nsec):
     |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
     | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
     | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
     | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
     | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
     | 99.95th=[ 798720], 99.99th=[5341184]
   bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45
   iops        : min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45
  lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
  lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
  cpu          : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec

Clearly both bandwidth and iops are worse. However I'm puzzled as to why this is 
the case, given that I don't see how this patch affects the submission of 
readahead io. 

 fs/btrfs/extent_io.c | 63 +++++++++++++-------------------------------
 1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 233f835dd6f8..c097eec1b73d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3059,7 +3059,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 	return ret;
 }
 
-static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
+static inline void contiguous_readpages(struct extent_io_tree *tree,
 					     struct page *pages[], int nr_pages,
 					     u64 start, u64 end,
 					     struct extent_map **em_cached,
@@ -3090,46 +3090,6 @@ static inline void __do_contiguous_readpages(struct extent_io_tree *tree,
 	}
 }
 
-static void __extent_readpages(struct extent_io_tree *tree,
-			       struct page *pages[],
-			       int nr_pages,
-			       struct extent_map **em_cached,
-			       struct bio **bio, unsigned long *bio_flags,
-			       u64 *prev_em_start)
-{
-	u64 start = 0;
-	u64 end = 0;
-	u64 page_start;
-	int index;
-	int first_index = 0;
-
-	for (index = 0; index < nr_pages; index++) {
-		page_start = page_offset(pages[index]);
-		if (!end) {
-			start = page_start;
-			end = start + PAGE_SIZE - 1;
-			first_index = index;
-		} else if (end + 1 == page_start) {
-			end += PAGE_SIZE;
-		} else {
-			__do_contiguous_readpages(tree, &pages[first_index],
-						  index - first_index, start,
-						  end, em_cached,
-						  bio, bio_flags,
-						  prev_em_start);
-			start = page_start;
-			end = start + PAGE_SIZE - 1;
-			first_index = index;
-		}
-	}
-
-	if (end)
-		__do_contiguous_readpages(tree, &pages[first_index],
-					  index - first_index, start,
-					  end, em_cached, bio,
-					  bio_flags, prev_em_start);
-}
-
 static int __extent_read_full_page(struct extent_io_tree *tree,
 				   struct page *page,
 				   get_extent_t *get_extent,
@@ -4104,6 +4064,13 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 	u64 prev_em_start = (u64)-1;
 
 	while (!list_empty(pages)) {
+		u64 contig_end = 0;
+
+		/*
+		 * Produces a batch of up to 16 contiguous pages, assumes
+		 * that pages are consecutive in pages list (guaranteed by the
+		 * generic code)
+		 **/
 		for (nr = 0; nr < BTRFS_PAGES_BATCH && !list_empty(pages);) {
 			struct page *page = lru_to_page(pages);
 
@@ -4112,14 +4079,22 @@ int extent_readpages(struct address_space *mapping, struct list_head *pages,
 			if (add_to_page_cache_lru(page, mapping, page->index,
 						readahead_gfp_mask(mapping))) {
 				put_page(page);
-				continue;
+				break;
 			}
 
 			pagepool[nr++] = page;
+			contig_end = page_offset(page) + PAGE_SIZE - 1;
 		}
 
-		__extent_readpages(tree, pagepool, nr, &em_cached, &bio,
-				   &bio_flags, &prev_em_start);
+		if (nr) {
+			u64 contig_start = page_offset(pagepool[0]);
+
+			ASSERT(contig_start + (nr*PAGE_SIZE) - 1 == contig_end);
+
+			contiguous_readpages(tree, pagepool, nr, contig_start,
+				     contig_end, &em_cached, &bio, &bio_flags,
+				     &prev_em_start);
+		}
 	}
 
 	if (em_cached)