md/raid1: round robin disk selection for large sequential reads

Hi Everyone,

I’m sending this out as an RFC with some comments below.  Hope
that’s OK, note that this is my first patch for mdraid but the
first time I’ve ever even looked at any of the code so please
keep that in mind :) Would be great if others had a chance to
test on their setups with their favorite disks or if you have
disk manufacturer/model suggestions please me know.

Quick background: A colleague of mine ran a short test with
biotop running random reads on a 2 disk array and noticed an
approximate 70/30 distribution of reads between the 2 drives.
He asked me if the was normal and, not knowing this RAID stack,
I said I don’t know but I’ll have a look.  As a quick hack I
threw out pretty much the whole function and forced a 50/50
distribution on got impressive gains on large sequential IO, it
was enough that I decided to start working on it for real.

This RFC patch adds more complexity to an already fairly complex
function I think but I don’t see a simple refactor and most of
the additions are compartmentalized at the end.  The concept is
simple, round robin the available disks on sequential reads
(doesn’t work well for non-sequential) switching to the next disk
after a specified number of bytes have been read from the current
disk (empirically determined).  To reduce testing the patch only
supports doing round robin selection for 2 mirror sets or less.
More could be supported likely adjusting the threshold for moving
to the next disk based on the number of mirrors.

Additionally, I think I can optimize the read/write mix cases as
well but didn’t want to hold off any longer on getting the RFC out.

I am also gathering more performance data on other drivs.  Here is
what I have available now on Google Photos:

Kioxia CM7: https://photos.app.goo.gl/TpS8wH1rCtnwLVLw9
Intel P5520: https://photos.app.goo.gl/zGBPFT7mJCXdoU8h6

Comments:

* There a subtle bug fix in here that if everyone agrees on the
patch direction I’ll break it out into a separate patch.  We used
to set R1BIO_FailFast without really knowing if there is more than
1 available disks.  Discussed this on slack, it’s fairly clear. To
fix this, there are no more breaks in the loop, we always loop
through all disks counting available ones and storing key parameters
that were set to their last value as a result of the break.

* This patch also needs the “loop through all disks” fix above
because we can’t round robin unless we know we have enough disks

* There’s a potential bug in the sequential condition, I removed the
entire block of code as it’s no longer relevant with the round robin
logic (the section about choose_next_idle).  This was discussed a
bit on slack, there was a questionable comparison
`mirror->next_seq_sect > opt_iosize`

* There’s nothing really tricky going on here, the one thing that
might not be super obvious is how we round robin, we start on the
current disk and store it’s index in the mirrors[] array so that we
can find the next one on the next call to read_balance.  We have to
select the index into the array as opposed to `best_disk` as that can
be at any position in the array depending on the state of the array.

thanks,
Paul

Signed-off-by: paul luse <paul.e.luse@linux.intel.com>
---
 drivers/md/raid1.c | 107 +++++++++++++++++++++++++++++++--------------
 drivers/md/raid1.h |   6 +++
 2 files changed, 79 insertions(+), 34 deletions(-)

Message ID	20231217074814.122713-1-paul.e.luse@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.115]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC3E9B64C for <linux-raid@vger.kernel.org>; Sun, 31 Dec 2023 19:11:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="GkDyMq9V" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1704049887; x=1735585887; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=u3q7yWvJAedRsdJcroMWjnADpn7M2DZkSyKU7rcRiV8=; b=GkDyMq9VAYQViR/GJPUN8ZZ9vCgT7cUfdesDNw0JzVtgi1wXDaDViSd3 ey3A9XrsuXgItC5I7ItcwrDru27a/degIg1PqF8/u4/EqhDJ+3wlelMoI Mc59R/Iuzi8PuL59PWttYUprMJDZt/ZYk1JBjqRYzTH5AiLc3Anmh+dEB Ko/3wQEpfAvcuaEXMD4S2HVGUHv8STiCezI60ev4+JLcyF+w2o65kR3zM JOQKCnGci6RYLsly/qk5xxvgCO/4muYUY8ZjKodHqAZ+Pwa0BIIw/DWdc oljp1WZumY1NQoN8Zr5O9MQHaKmKyO3LPKxXkC9H44vnPNrAqlMlv3Z+W A==; X-IronPort-AV: E=McAfee;i="6600,9927,10940"; a="396494997" X-IronPort-AV: E=Sophos;i="6.04,320,1695711600"; d="scan'208";a="396494997" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Dec 2023 11:11:26 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,320,1695711600"; d="scan'208";a="27658600" Received: from abjones1-mobl.amr.corp.intel.com (HELO peluse-desk5.intel.com) ([10.209.128.206]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Dec 2023 11:11:26 -0800 From: paul luse <paul.e.luse@linux.intel.com> To: song@kernel.org Cc: linux-raid@vger.kernel.org, paul luse <paul.e.luse@linux.intel.com> Subject: [PATCH] md/raid1: round robin disk selection for large sequential reads Date: Sat, 16 Dec 2023 23:48:14 -0800 Message-ID: <20231217074814.122713-1-paul.e.luse@linux.intel.com> X-Mailer: git-send-email 2.41.0 Precedence: bulk X-Mailing-List: linux-raid@vger.kernel.org List-Id: <linux-raid.vger.kernel.org> List-Subscribe: <mailto:linux-raid+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-raid+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	md/raid1: round robin disk selection for large sequential reads \| expand md/raid1: round robin disk selection for large sequential reads

md/raid1: round robin disk selection for large sequential reads

Commit Message

Patch