diff mbox

btrfs: chunk_width_limit mount option

Message ID 1464918615-6458-1-git-send-email-andrew@asquaredlabs.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Armenia June 3, 2016, 1:50 a.m. UTC
This patch adds mount option 'chunk_width_limit=X', which when set forces
the chunk allocator to use only up to X devices when allocating a chunk.
This may help reduce the seek penalties seen in filesystems with large
numbers of devices.

Signed-off-by: Andrew Armenia <andrew@asquaredlabs.com>
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/super.c   | 22 +++++++++++++++++++++-
 fs/btrfs/volumes.c | 26 ++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 1 deletion(-)

Comments

David Sterba June 6, 2016, 9:17 a.m. UTC | #1
On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
> This patch adds mount option 'chunk_width_limit=X', which when set forces
> the chunk allocator to use only up to X devices when allocating a chunk.
> This may help reduce the seek penalties seen in filesystems with large
> numbers of devices.

There is a use for such limit but passing it as a mount is not the right
way to do it. Changing the value requires remount, different raid levels
might need different values, it's not persistent etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Armenia June 6, 2016, 1:43 p.m. UTC | #2
On Mon, Jun 6, 2016 at 5:17 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
>> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> the chunk allocator to use only up to X devices when allocating a chunk.
>> This may help reduce the seek penalties seen in filesystems with large
>> numbers of devices.
>
> There is a use for such limit but passing it as a mount is not the right
> way to do it. Changing the value requires remount, different raid levels
> might need different values, it's not persistent etc.

Thanks for the feedback. I agree that it's less than an ideal way to
do it. I guess I was a bit quick to release proof-of-concept code.
I'll try to do some more work on the configuration side of things and
resubmit.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugo Mills June 6, 2016, 3 p.m. UTC | #3
On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote:
> On Mon, Jun 6, 2016 at 5:17 AM, David Sterba <dsterba@suse.cz> wrote:
> > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
> >> the chunk allocator to use only up to X devices when allocating a chunk.
> >> This may help reduce the seek penalties seen in filesystems with large
> >> numbers of devices.
> >
> > There is a use for such limit but passing it as a mount is not the right
> > way to do it. Changing the value requires remount, different raid levels
> > might need different values, it's not persistent etc.
> 
> Thanks for the feedback. I agree that it's less than an ideal way to
> do it. I guess I was a bit quick to release proof-of-concept code.
> I'll try to do some more work on the configuration side of things and
> resubmit.

   Last time something like this came up in discussion, the consensus
was to put the configuration in xattrs in the btrfs namespace. There
should be one with a name to indicate "global" configuration, applied
to the top of subvol 5. If the feature allows configuration on a
per-subvol or per-object basis, then there should also be a name for
the relevant xattr (also in the btrfs namespace) that can be created
on each object as required.

   Hugo.
Andrew Armenia June 9, 2016, 7:19 p.m. UTC | #4
On Mon, Jun 6, 2016 at 11:00 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote:
>> On Mon, Jun 6, 2016 at 5:17 AM, David Sterba <dsterba@suse.cz> wrote:
>> > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
>> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> >> the chunk allocator to use only up to X devices when allocating a chunk.
>> >> This may help reduce the seek penalties seen in filesystems with large
>> >> numbers of devices.
>> >
>> > There is a use for such limit but passing it as a mount is not the right
>> > way to do it. Changing the value requires remount, different raid levels
>> > might need different values, it's not persistent etc.
>>
>> Thanks for the feedback. I agree that it's less than an ideal way to
>> do it. I guess I was a bit quick to release proof-of-concept code.
>> I'll try to do some more work on the configuration side of things and
>> resubmit.
>
>    Last time something like this came up in discussion, the consensus
> was to put the configuration in xattrs in the btrfs namespace. There
> should be one with a name to indicate "global" configuration, applied
> to the top of subvol 5. If the feature allows configuration on a
> per-subvol or per-object basis, then there should also be a name for
> the relevant xattr (also in the btrfs namespace) that can be created
> on each object as required.
>
>    Hugo.
>
> --
> Hugo Mills             | Klytus, I'm bored. What plaything can you offer me
> hugo@... carfax.org.uk | today?
> http://carfax.org.uk/  |
> PGP: E2AB1DE4          |                      Ming the Merciless, Flash Gordon

OK, since I'm new here I'll throw my thoughts out before writing and
submitting another patch. I wouldn't want to have to read the xattrs
from disk every time a chunk needs to be allocated. That means the
allocator configuration will need to be cached in the fs_info
structure. My current line of thinking is to create a
btrfs_allocator_config structure containing all of the allocator
parameters - for now a user-set devs_max for each RAID level. I'd add
a couple ioctls to allow the userspace tools to get and set these
parameters. The "set" ioctl would also persist this configuration to
one or more xattrs on the root of subvol 5 per Hugo's suggestion.

I'm still a bit hung up on the proper way to read the config back at
mount time. Patching the mount routines doesn't seem like a good way
to do it. Perhaps just doing it at the first invocation of
__btrfs_alloc_chunk is best? Any other thoughts?

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain June 13, 2016, 2:06 a.m. UTC | #5
On 06/03/2016 09:50 AM, Andrew Armenia wrote:
> This patch adds mount option 'chunk_width_limit=X', which when set forces
> the chunk allocator to use only up to X devices when allocating a chunk.
> This may help reduce the seek penalties seen in filesystems with large
> numbers of devices.

Have you reviewed implementations like device allocation grouping?
Some info is in the btrfs project ideas..

https://btrfs.wiki.kernel.org/index.php/Project_ideas
  Chunk allocation groups
  Limits on number of stripes (stripe width)
  Linear chunk allocation mode

(Device allocation grouping is important for enterprise storage solutions).

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Armenia June 14, 2016, 7:44 p.m. UTC | #6
On Sun, Jun 12, 2016 at 10:06 PM, Anand Jain <anand.jain@oracle.com> wrote:
>
>
> On 06/03/2016 09:50 AM, Andrew Armenia wrote:
>>
>> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> the chunk allocator to use only up to X devices when allocating a chunk.
>> This may help reduce the seek penalties seen in filesystems with large
>> numbers of devices.
>
>
> Have you reviewed implementations like device allocation grouping?
> Some info is in the btrfs project ideas..
>
> https://btrfs.wiki.kernel.org/index.php/Project_ideas
>  Chunk allocation groups
>  Limits on number of stripes (stripe width)
>  Linear chunk allocation mode
>
> (Device allocation grouping is important for enterprise storage solutions).
>
> Thanks, Anand

I have looked at those ideas; allocation groups are what I'm ideally
after but I decided to start out small. I just spotted the dev_group
field in the device tree that appears to be currently unused, so
perhaps I will look at developing a group-aware allocator instead of
just limiting the chunk width.

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugo Mills June 14, 2016, 7:55 p.m. UTC | #7
On Tue, Jun 14, 2016 at 03:44:47PM -0400, Andrew Armenia wrote:
> On Sun, Jun 12, 2016 at 10:06 PM, Anand Jain <anand.jain@oracle.com> wrote:
> >
> >
> > On 06/03/2016 09:50 AM, Andrew Armenia wrote:
> >>
> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
> >> the chunk allocator to use only up to X devices when allocating a chunk.
> >> This may help reduce the seek penalties seen in filesystems with large
> >> numbers of devices.
> >
> >
> > Have you reviewed implementations like device allocation grouping?
> > Some info is in the btrfs project ideas..
> >
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas
> >  Chunk allocation groups
> >  Limits on number of stripes (stripe width)
> >  Linear chunk allocation mode
> >
> > (Device allocation grouping is important for enterprise storage solutions).
> >
> > Thanks, Anand
> 
> I have looked at those ideas; allocation groups are what I'm ideally
> after but I decided to start out small. I just spotted the dev_group
> field in the device tree that appears to be currently unused, so
> perhaps I will look at developing a group-aware allocator instead of
> just limiting the chunk width.

   I made some design notes on a generalised approach for this a while
ago:

http://www.spinics.net/lists/linux-btrfs/msg33782.html
http://www.spinics.net/lists/linux-btrfs/msg33916.html

   Hugo.
diff mbox

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..27b6f8f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -754,6 +754,9 @@  struct btrfs_fs_info {
 	unsigned long pending_changes;
 	unsigned long compress_type:4;
 	int commit_interval;
+
+	int chunk_width_limit;
+
 	/*
 	 * It is a suggestive number, the read side is safe even it gets a
 	 * wrong number because we will write out the data into a regular
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4e59a91..3da5220 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -300,7 +300,7 @@  enum {
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
 	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_usebackuproot,
-	Opt_nologreplay, Opt_norecovery,
+	Opt_nologreplay, Opt_norecovery, Opt_width_limit,
 #ifdef CONFIG_BTRFS_DEBUG
 	Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all,
 #endif
@@ -360,6 +360,7 @@  static const match_table_t tokens = {
 	{Opt_rescan_uuid_tree, "rescan_uuid_tree"},
 	{Opt_fatal_errors, "fatal_errors=%s"},
 	{Opt_commit_interval, "commit=%d"},
+	{Opt_width_limit, "chunk_width_limit=%d"},
 #ifdef CONFIG_BTRFS_DEBUG
 	{Opt_fragment_data, "fragment=data"},
 	{Opt_fragment_metadata, "fragment=metadata"},
@@ -782,6 +783,22 @@  int btrfs_parse_options(struct btrfs_root *root, char *options,
 				info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 			}
 			break;
+		case Opt_width_limit:
+			intarg = 0;
+			ret = match_int(&args[0], &intarg);
+			if (ret < 0) {
+				btrfs_err(root->fs_info, "invalid chunk width limit");
+				ret = -EINVAL;
+				goto out;
+			}
+
+			if (intarg > 0) {
+				info->chunk_width_limit = intarg;
+			} else {
+				btrfs_info(root->fs_info, "chunk width is unlimited");
+				info->chunk_width_limit = 0;
+			}
+			break;
 #ifdef CONFIG_BTRFS_DEBUG
 		case Opt_fragment_all:
 			btrfs_info(root->fs_info, "fragmenting all space");
@@ -1207,6 +1224,9 @@  static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (info->thread_pool_size !=  min_t(unsigned long,
 					     num_online_cpus() + 2, 8))
 		seq_printf(seq, ",thread_pool=%d", info->thread_pool_size);
+	if (info->chunk_width_limit != 0)
+		seq_printf(seq, ",chunk_width_limit=%d",
+			info->chunk_width_limit);
 	if (btrfs_test_opt(root, COMPRESS)) {
 		if (info->compress_type == BTRFS_COMPRESS_ZLIB)
 			compress_type = "zlib";
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bdc6256..6d0d35d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4558,6 +4558,32 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	devs_increment = btrfs_raid_array[index].devs_increment;
 	ncopies = btrfs_raid_array[index].ncopies;
 
+	/*
+	 * if we have a statically-configured chunk width, and the type doesn't
+	 * specify one, go ahead and use the statically-configured max instead.
+	 *
+	 * If the static value is greater than the BTRFS_MAX_DEVS for the
+	 * chunk tree, we ignore it.
+	 *
+	 * Also, we ignore the static value for system chunks.
+	 */
+	if (
+		devs_max == 0 && info->chunk_width_limit != 0
+		&& !(type & BTRFS_BLOCK_GROUP_SYSTEM)
+		&& info->chunk_width_limit <= BTRFS_MAX_DEVS(info->chunk_root)
+	) {
+		if (info->chunk_width_limit >= devs_min) {
+			devs_max = info->chunk_width_limit;
+		} else {
+			/* warn that the static devs_max is unusable */
+			btrfs_warn(info,
+				"can't satisfy max chunk width of %d; "
+				"minimum %d devices needed",
+				info->chunk_width_limit, devs_max
+			);
+		}
+	}
+
 	if (type & BTRFS_BLOCK_GROUP_DATA) {
 		max_stripe_size = SZ_1G;
 		max_chunk_size = 10 * max_stripe_size;