diff mbox

BTRFS: Adds an option to select RAID Stripe size

Message ID 1451305451-31222-1-git-send-email-jpage.lkml@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sanidhya Solanki Dec. 28, 2015, 12:24 p.m. UTC
An option to select the RAID Stripe size is made
available in the BTRFS Filesystem, via an option
in the BTRFS Config setup, with minimal change
to the existing code base.

Signed-off-by: Sanidhya Solanki <jpage.lkml@gmail.com>
---
 fs/btrfs/Kconfig   | 42 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/Makefile  |  2 ++
 fs/btrfs/scrub.c   |  4 ++++
 fs/btrfs/super.c   |  4 ++++
 fs/btrfs/volumes.c |  2 +-
 fs/btrfs/volumes.h |  4 +++-
 6 files changed, 56 insertions(+), 2 deletions(-)

Comments

Sanidhya Solanki Dec. 28, 2015, 8:38 p.m. UTC | #1
On Mon, 28 Dec 2015 23:19:55 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> On Mon, 2015-12-28 at 07:24 -0500, Sanidhya Solanki wrote:
> > An option to select the RAID Stripe size is made
> > available in the BTRFS Filesystem, via an option
> > in the BTRFS Config setup
> Shouldn't that rather eventually become configurable per filesystem?
> 
> Cheers,
> Chris.

Don't know. It was in the BTRFS File todo list, hence the
BTRFS-specific patch.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 28, 2015, 9:43 p.m. UTC | #2
On Tue, 29 Dec 2015 02:21:09 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> On Mon, 2015-12-28 at 15:38 -0500, Sanidhya Solanki wrote:
> > > Shouldn't that rather eventually become configurable per
> > > filesystem?
> > Don't know. It was in the BTRFS File todo list, hence the
> > BTRFS-specific patch.
> Uhm you misunderstood me =)
> 
> I meant, configurable per instance of a btrfs filesystem, i.e. at e.g.
> (btr)fs-creation time, rather than compiled into the kernel. =)
> 
> Cheers,
> Chris.

Apologies, it appears I did misunderstand. That should be possible,
though slightly complicated. I will try to get that done. But I am
leaving this patch open for acceptance and comments.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 28, 2015, 10:19 p.m. UTC | #3
On Mon, 2015-12-28 at 07:24 -0500, Sanidhya Solanki wrote:
> An option to select the RAID Stripe size is made
> available in the BTRFS Filesystem, via an option
> in the BTRFS Config setup
Shouldn't that rather eventually become configurable per filesystem?

Cheers,
Chris.
Sanidhya Solanki Dec. 29, 2015, 12:03 a.m. UTC | #4
On Tue, 29 Dec 2015 04:42:08 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> On Mon, 2015-12-28 at 16:43 -0500, Sanidhya Solanki wrote:
> May get even much more complicated, if reshaping (i.e. conversion from
> one chunk size to another) should get implemented as well...

That sounds like an absolutely ghastly idea. Lots of potential for
mistakes and potential data loss. I take up the offer to implement
such a feature. 
Only question is should it be in-place replacement or replace out to
another disk or storage type. Will wait for comments on that question
before implementing. 

> Sure :)
> Being able to set the RAID chunk size makes AFAIU sense for every RAID
> level, depending which IO patterns one has (i.e. even for RAID1).

I do not understand this part. Is the stripe size not independent of
the RAID type? To me it would seem that the filesystem just treats the
RAID type as a means of duplicating (if in the case of a redundant
RAID) or parallelizing data (in case of "R"AID 0), as opposed to the
Stripe length, which is a measure of each logical block for the
filesystem.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 29, 2015, 1:21 a.m. UTC | #5
On Mon, 2015-12-28 at 15:38 -0500, Sanidhya Solanki wrote:
> > Shouldn't that rather eventually become configurable per
> > filesystem?
> Don't know. It was in the BTRFS File todo list, hence the
> BTRFS-specific patch.
Uhm you misunderstood me =)

I meant, configurable per instance of a btrfs filesystem, i.e. at e.g.
(btr)fs-creation time, rather than compiled into the kernel. =)

Cheers,
Chris.
Sanidhya Solanki Dec. 29, 2015, 1:31 a.m. UTC | #6
On Tue, 29 Dec 2015 05:26:28 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> I spoke largely from the user/admin side,... running a quite big
> storage Tier-2, we did many IO benchmarks over time (with different
> hardware RAID controllers) and also as our IO patterns changed over
> time...
> The result was that our preferred RAID chunk sizes changed over
> time,...

What is your experience like about running a production system on what
is essentially a beta product? Crashes?

Would something like ZFS not be more suited to your environment?
Especially as not all disks will be full, and, if a disk was to fail,
the entire disk would need to be rebuilt from parity drives (as opposed
to ZFS only using the parity data, and not copying empty blocks
(another feature that is planned for BTRFS)). That alone sells me on
ZFS' capabilities over BTRFS.
 
> Being able to to an online conversion (i.e. on the mounted fs) would
> be nice of course (from the sysadmin's side of view) but even if that
> doesn't seem feasible an offline conversion may be useful (one simply
> may not have enough space left elsewhere to move the data of and
> create a new fs with different RAID chunk size from scratch)
> Both open of course many questions (how to deal with crashes, etc.)...
> maybe having a look at how mdadm handles similar problems could be
> worth.

I do not believe it would be possible to guarantee crash or error
recovery when using an in-place rebuild, without slowing down the
entire rebuild to cache each block before replacing it with the new
block. That would slow it down considerably, as you would have to:

copy to cache > checksum > write in place on disk > checksum new data >
verify checksums

I suppose that is the only proper way to do it anyway, but it will
definitely be slow.

Let me know if that is acceptable, and when the developers come online,
they can also input their ideas.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 29, 2015, 2:23 a.m. UTC | #7
On Tue, 29 Dec 2015 07:03:11 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> On Mon, 2015-12-28 at 20:31 -0500, Sanidhya Solanki wrote:
> > What is your experience like about running a production system on
> > what
> > is essentially a beta product? Crashes?
> What do you mean? btrfs? I'm not yet running it in production (there
> was a subthread recently, where I've explained a bit more why).

From Documentation/filesystems/BTRFS.txt:
Btrfs is under heavy development, and is not suitable for
any uses other than benchmarking and review. The Btrfs disk format is
not yet finalized.

> > Would something like ZFS not be more suited to your environment?
> Well I guess that's my personal political decision... I simply think
> that btrfs should and will be the next gen Linux main filesystem.
> Plus that IIRC zfs-fuse is no unmaintained and linux-zfs not yet part
> of Debian rules it anyway out, as I'd be tool lazy to compile it
> myself (at least for work ;) ).

The kernel module "ZFS on Linux" is still actively developed according
to their repo activity (https://github.com/zfsonlinux/zfs). It is done
by a subcontractor for Lawrence Livermore National Laboratory.
Considering they are still in preparation for their Next Generation
Supercomputer (Summit), we can assume that they will keep financing the
development.

> > Especially as not all disks will be full, and, if a disk was to
> > fail, the entire disk would need to be rebuilt from parity drives
> > (as opposed
> > to ZFS only using the parity data, and not copying empty blocks
> > (another feature that is planned for BTRFS))
> Ah? I thought btrfs would already do that as well?

Not yet. Not according to the source code "todo"s atleast.
Thanks for the information about HW RAID controllers.
> I'm not sure what you mean by "cache"... wouldn't btrfs' CoW mean that
> you "just" copy the data, and once this is done, update the metadata
> and things would be either consistent or they would not (and in case
> of a crash still point to the old, not yet reshaped, data)?
> 
> A special case were of course nodatacow'ed data.... there one may need
> some kind of cache or journal... (see the other thread of mine, where
> I ask for checksumming with no-CoWed data =) ).

I just started working recently on BTRFS, I forgot about the CoW part.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 29, 2015, 3:42 a.m. UTC | #8
On Mon, 2015-12-28 at 16:43 -0500, Sanidhya Solanki wrote:
> Apologies, it appears I did misunderstand. That should be possible,
> though slightly complicated. I will try to get that done.
May get even much more complicated, if reshaping (i.e. conversion from
one chunk size to another) should get implemented as well...


> But I am
> leaving this patch open for acceptance and comments.
Sure :)
Being able to set the RAID chunk size makes AFAIU sense for every RAID
level, depending which IO patterns one has (i.e. even for RAID1).

Cheers,
Chris.
Christoph Anton Mitterer Dec. 29, 2015, 4:26 a.m. UTC | #9
On Mon, 2015-12-28 at 19:03 -0500, Sanidhya Solanki wrote:
> That sounds like an absolutely ghastly idea.
*G* and it probably is ;)


>  Lots of potential for
> mistakes and potential data loss. I take up the offer to implement
> such a feature. 
> Only question is should it be in-place replacement or replace out to
> another disk or storage type. Will wait for comments on that question
> before implementing. 
I guess you really should have a decent discussion with some of the
core btrfs developers (which I am not) before doing any efforts on this
(and possibly wasting great amounts of work).

I spoke largely from the user/admin side,... running a quite big
storage Tier-2, we did many IO benchmarks over time (with different
hardware RAID controllers) and also as our IO patterns changed over
time...
The result was that our preferred RAID chunk sizes changed over
time,...

Being able to to an online conversion (i.e. on the mounted fs) would be
nice of course (from the sysadmin's side of view) but even if that
doesn't seem feasible an offline conversion may be useful (one simply
may not have enough space left elsewhere to move the data of and create
a new fs with different RAID chunk size from scratch)
Both open of course many questions (how to deal with crashes, etc.)...
maybe having a look at how mdadm handles similar problems could be
worth.


> I do not understand this part.
I just wanted to say that being able to set the RAID chunk size makes
sense for any RAID level,... not just for e.g. the parity RAIDs.

>  Is the stripe size not independent of
> the RAID type?
Perhaps one should first do some normative definitions...
Based on e.g. https://en.wikipedia.org/wiki/File:RAID_6.svg I'd use:

RAID chunk size:
the size of A1 which equals the size of A2, and also of B2 and so on
IIRC, e2fsprogs call this the stride size (with a "d" not a "p")

RAID stripe size:
I personally would typically understand this to be the size of A1A2A3
together (which is equal to the one of B1B2B3 and so on).
But I've seen literature, software, people using it as a synonym for
the RAID chunk size  and some consider the stripe size to be also
include the parity, which in the RAID6 example would make it the length
of A1A2A3ApAq.
So I usually try to use "RAID data stripe size" or "RAID data+paritiy
stripe size"


btrfs chunk size = size of a btrfs data or meta-data chunk


Cheers,
Chris.
Christoph Anton Mitterer Dec. 29, 2015, 6:03 a.m. UTC | #10
On Mon, 2015-12-28 at 20:31 -0500, Sanidhya Solanki wrote:
> What is your experience like about running a production system on
> what
> is essentially a beta product? Crashes?
What do you mean? btrfs? I'm not yet running it in production (there
was a subthread recently, where I've explained a bit more why).

But right now I re-organise a lot of our data-pools and consider to set
up those, which serve just as replica holders, with btrfs.
But the RAID would likely be still from the HW controller.


> Would something like ZFS not be more suited to your environment?
Well I guess that's my personal political decision... I simply think
that btrfs should and will be the next gen Linux main filesystem.
Plus that IIRC zfs-fuse is no unmaintained and linux-zfs not yet part
of Debian rules it anyway out, as I'd be tool lazy to compile it myself
(at least for work ;) ).


> Especially as not all disks will be full, and, if a disk was to fail,
> the entire disk would need to be rebuilt from parity drives (as
> opposed
> to ZFS only using the parity data, and not copying empty blocks
> (another feature that is planned for BTRFS))
Ah? I thought btrfs would already do that as well?

Well anyway... I did some comparison between HW RAID and MD RAID each
with ext4 and btrfs.
I haven't tried btrfs-RAID6 back then, since it's IMHO still too far
away from being production ready.

IIRC, there were be some (for us) interesting cases where MD RAID would
have been somewhat faster than HW RAID,... but there are some other
major IO patters (IIRC sequential read/write) where HW RAID was simply
magnitudes faster (no big surprise of course).


> I do not believe it would be possible to guarantee crash or error
> recovery when using an in-place rebuild, without slowing down the
> entire rebuild to cache each block before replacing it with the new
> block. That would slow it down considerably, as you would have to:
> 
> copy to cache > checksum > write in place on disk > checksum new data
> >
> verify checksums
I'm not sure what you mean by "cache"... wouldn't btrfs' CoW mean that
you "just" copy the data, and once this is done, update the metadata
and things would be either consistent or they would not (and in case of
a crash still point to the old, not yet reshaped, data)?

A special case were of course nodatacow'ed data.... there one may need
some kind of cache or journal... (see the other thread of mine, where I
ask for checksumming with no-CoWed data =) ).


> I suppose that is the only proper way to do it anyway, but it will
> definitely be slow.
From my PoV... slowness doesn't matter *that* much here anyway, while
consitency/safety does.
I mean reshaping a RAID wouldn't be something you'd do every month (at
least not in production systems - test systems are of course another
case).
Once I'd have determined that another RAID chunk size would perform
*considerably* better than the current, I'd reshape... and whether that
runs than for a week or two... as long as it happens online and as long
as I can control a bit how much IO is spent on the reshape: who cares?


Cheers,
Chris.
Sanidhya Solanki Dec. 29, 2015, 11:15 a.m. UTC | #11
On Tue, 29 Dec 2015 14:39:07 +0100
David Sterba <dsterba@suse.cz> wrote:

> The stripe size depends on how the filesystem was made, at the moment
> the stripesize parameter is missing from mkfs. The kernel module
> should support all sizes at runtime, so it's not a compile-time
> option.

No good? I will try and re-implement it as a runtime option.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Dec. 29, 2015, 1:39 p.m. UTC | #12
On Mon, Dec 28, 2015 at 07:24:11AM -0500, Sanidhya Solanki wrote:
> An option to select the RAID Stripe size is made
> available in the BTRFS Filesystem, via an option
> in the BTRFS Config setup, with minimal change
> to the existing code base.

The stripe size depends on how the filesystem was made, at the moment
the stripesize parameter is missing from mkfs. The kernel module should
support all sizes at runtime, so it's not a compile-time option.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 29, 2015, 3:32 p.m. UTC | #13
On Mon, 2015-12-28 at 21:23 -0500, Sanidhya Solanki wrote:
> From Documentation/filesystems/BTRFS.txt:
> Btrfs is under heavy development, and is not suitable for
> any uses other than benchmarking and review.
Well I guess now it's time for Duncan's usual "stable or not" talk
(@Duncan, I think by now, you should have made it into some verse or
ballad form... :D for general pleasure ;) )


> The kernel module "ZFS on Linux" is still actively developed
> according
> to their repo activity (https://github.com/zfsonlinux/zfs). It is don
> e
> by a subcontractor for Lawrence Livermore National Laboratory.
> Considering they are still in preparation for their Next Generation
> Supercomputer (Summit), we can assume that they will keep financing
> the
> development.
Sure, I know,... zfs-fuse (i.e. the fuse and not the kernel module
version) is AFAIC, unmaintained (or soon to be).


> I just started working recently on BTRFS, I forgot about the CoW
> part.
No worries.
As I've said,... you probably should have some deeper discussions with
the core developers on how they think variable chunk size support and
any reshaping should be implemented.
Otherwise you may just put a lot of effort into something, that may
conflict some other longer term design ideas.


Cheers,
Chris.
Duncan Dec. 29, 2015, 4:44 p.m. UTC | #14
Christoph Anton Mitterer posted on Tue, 29 Dec 2015 16:32:59 +0100 as
excerpted:

>> From Documentation/filesystems/BTRFS.txt:
>> Btrfs is under heavy development, and is not suitable for any uses
>> other than benchmarking and review.

> Well I guess now it's time for Duncan's usual "stable or not" talk
> (@Duncan, I think by now, you should have made it into some verse or
> ballad form... :D for general pleasure ;) )

=:^)

The devs did remove most of the experimental warnings some versions ago.  
I guess they missed that one.  The "heavy development" part is definitely 
still correct, but with the caveats below, I don't believe the "only 
benchmarking and review" fits the generally held list position, these 
days.

As I normally put it, btrfs is "definitely stabilizING, but not yet 
entirely stable and mature."  What that means in real life is that while 
not yet recommended for production use where down time costs money and 
potentially jobs, it's generally ready for routine daily use, PROVIDED 
one observes the usual admin's rule of backups, that for any level of 
backup, either you have it, or you consider the data it would cover to be 
worth less than the hassle and resources that backup would take, modified 
by the risk factor of actually having to use the backup.  Because btrfs 
is still stabilizing, that risk factor remains somewhat elevated, so you 
better have at least 1-2 levels of backup if you don't consider the data 
of trivial throw-away value.

Beyond that, keeping up with the list and staying relatively current with 
your kernel and btrfs-progs userspace are strongly recommended as well.

But it's definitely not recommended yet for the conservative stability 
types that run half a decade old "stable" enterprise distros.  That sort 
of use-case is in basic conflict with where btrfs is at this point, and 
people wishing to run it should be looking to some other filesystem, or 
if they /do/ choose to take up the enterprise distros on their offer of 
support for btrfs, should be looking to them for that support, as it's 
them that are choosing to offer it, while the general position on the 
list seems to be "that's insane".

Similarly of course for those unwilling to do backups or run relatively 
current kernels and btrfs userspace, or to keep up with the list.  In 
that case, btrfs isn't an appropriate choice for them.

But while not entirely stable and mature yet, that's still rather beyond 
"not suitable for any uses other than benchmarking and review", and 
indeed, most of the other wording of that nature was stripped around 
kernel 3.12, and while some of us considered that a bit early, I think 
most would agree that the "benchmarking and review" only wording is 
somewhat dated, by now.
David Sterba Dec. 29, 2015, 5:06 p.m. UTC | #15
On Tue, Dec 29, 2015 at 06:15:12AM -0500, Sanidhya Solanki wrote:
> On Tue, 29 Dec 2015 14:39:07 +0100
> David Sterba <dsterba@suse.cz> wrote:
> 
> > The stripe size depends on how the filesystem was made, at the moment
> > the stripesize parameter is missing from mkfs. The kernel module
> > should support all sizes at runtime, so it's not a compile-time
> > option.
> 
> No good? I will try and re-implement it as a runtime option.

So you want to make the stripe size configurable? The stripesize is
sotred in the superblock but as the hardcoded value is 64k through the
BTRFS_STRIPE_LEN define, the superblock value is not honored in the
code. I don't know about all implications from changing the define to
sb->stripesize, also we want to define the allowed range etc. It would
be better to add more description to the patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Dec. 29, 2015, 6:06 p.m. UTC | #16
On Tue, Dec 29, 2015 at 05:26:28AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2015-12-28 at 19:03 -0500, Sanidhya Solanki wrote:
> > That sounds like an absolutely ghastly idea.
> *G* and it probably is ;)
> 
> 
> >  Lots of potential for
> > mistakes and potential data loss. I take up the offer to implement
> > such a feature. 
> > Only question is should it be in-place replacement or replace out to
> > another disk or storage type. Will wait for comments on that question
> > before implementing. 
> I guess you really should have a decent discussion with some of the
> core btrfs developers (which I am not) before doing any efforts on this
> (and possibly wasting great amounts of work).
> 
> I spoke largely from the user/admin side,... running a quite big
> storage Tier-2, we did many IO benchmarks over time (with different
> hardware RAID controllers) and also as our IO patterns changed over
> time...
> The result was that our preferred RAID chunk sizes changed over
> time,...
> 
> Being able to to an online conversion (i.e. on the mounted fs) would be
> nice of course (from the sysadmin's side of view)

In theory this is possible with current on-disk data structures. The
stripe length is property of btrfs_chunk and changing it should be
possible the same way we do other raid transformations. The
implementation might be tricky at some places, but basically boils down
to the "read-" and "write-" stripe size. Reading chunks would always
respect the stored size, writing new data would use eg. the
superblock->stripesize or other value provided by the user.

> but even if that
> doesn't seem feasible an offline conversion may be useful (one simply
> may not have enough space left elsewhere to move the data of and create
> a new fs with different RAID chunk size from scratch)

Currently the userspace tools are not capable of the balance/relocation
functionality equivalent.

> Both open of course many questions (how to deal with crashes, etc.)...
> maybe having a look at how mdadm handles similar problems could be
> worth.

The crash consistency should remain, other than that we'd have to
enhance the balance filters to process only the unconverted chunks to
continue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 29, 2015, 9:32 p.m. UTC | #17
On Tue, 29 Dec 2015 18:06:11 +0100
David Sterba <dsterba@suse.cz> wrote:

> I don't know about all implications from changing the define to
> sb->stripesize, also we want to define the allowed range etc. It would
> be better to add more description to the patch.

So, is the patch atleast somewhat usable and you just need me to expand
the description, or should I write the whole patch from scratch,
modifying superblock size as a runtime configurable option?

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 30, 2015, 2:56 a.m. UTC | #18
On Tue, 2015-12-29 at 16:44 +0000, Duncan wrote:
> As I normally put it, btrfs is "definitely stabilizING, but not yet 
> entirely stable and mature."
[snip snap]

And now... as a song please :D

I'd also take a medieval rhyme ;-)


Cheers,
Chris.
Sanidhya Solanki Dec. 30, 2015, 6:39 a.m. UTC | #19
On Tue, 29 Dec 2015 18:06:11 +0100
David Sterba <dsterba@suse.cz> wrote:

> So you want to make the stripe size configurable?...

As I see it there are 3 ways to do it:
-Make it a compile time option that only configures it for a single
system with any devices that are added to the RAID.
-Make it a runtime option that can change based on how the
administrator configures it.
-A non-user facing option that is configurable by someone like a
distribution maintainer for all systems using the Binary Distribution.

As I see it, DS would like something like the third option, but CAM
(ostensibly a SysAdmin) wants the second option.

On the other hand, I implemented the first option. 

The first and third option can co-exit, the second is an orthogonal
target that needs to be setup separately.

Or we can make all options co-exist, but make it more complicated.

Please let me know which implementation is preferable, and, if you just
want me to expand the description (as DS' mail asked for) or redo the
entire setup.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 30, 2015, 9:54 a.m. UTC | #20
On Wed, 30 Dec 2015 19:59:16 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Not really sure about the difference between 2 and 3.

I should have made it clear before, I was asking the exact use case in
mind when listing the choices. Option 2 would be for SysAdmins running
production software and configuring it as they desire.
Option 3 is what we have in the Kernel now, before my patch, where the
option exists, but it is fixed by the code. You can change it, but you
need to be someone fairly involved in the upstream work (like a
distribution Maintainer). This is what my patch implements (well, this
and option 3).
Option 1 leaves it as a compile time option.

> When you mention runtime option, did you mean ioctl/mount/balance 
> convert option?

Yes, that is correct.

> And what's the third one? Default mkfs time option?
> If you can make it mkfs time option, it won't be really hard to make
> it configurable.

This would be ideal for all use-cases, but make the implementation
much larger than it would be for the other options. Hence, I asked
what the exact use case was for the end-user being targeted.
 
> I didn't consider David means something that.
> As far as I read, he means balance convert option along with mkfs
> option.

Hence, why I asked.

> At least from what I have learned in recent btrfs development,
He> He> either
> we provide a good enough interfaces (normally, balance convert ioctl
> with mkfs time option) to configure some on-disk fields.

Just confirming before starting the implementation.
> So fixed kernel value is not a really good idea, and should at least
> be replace by mkfs time option.

Will do after confirmation.

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 30, 2015, 11:15 a.m. UTC | #21
On Wed, 30 Dec 2015 22:10:44 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Understood now.

Good.

> I totally understand that implement ... to polish your
> skill.

That has got to be the most hilarious way I believe I have seen someone
delegate a task. But it was effective.

Only one problem. I do not run BTRFS on my systems nor do I have a
RAID setup, due to possessing a limited number of free drives. So, while
I may be able to code for it, I will not be able to test it. I will need
the community's help to do the testing.

I will get started tomorrow.

To-do (so far):
- Implement RAID Stripe length as a compile and runtime option.
- Implement a way to do an in-place Stripe Length change.
- Debugging & testing for the above additions.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Dec. 30, 2015, 11:59 a.m. UTC | #22
On 12/30/2015 02:39 PM, Sanidhya Solanki wrote:
> On Tue, 29 Dec 2015 18:06:11 +0100
> David Sterba <dsterba@suse.cz> wrote:
>
>> So you want to make the stripe size configurable?...
>
> As I see it there are 3 ways to do it:
> -Make it a compile time option that only configures it for a single
> system with any devices that are added to the RAID.
> -Make it a runtime option that can change based on how the
> administrator configures it.
> -A non-user facing option that is configurable by someone like a
> distribution maintainer for all systems using the Binary Distribution.

Not really sure about the difference between 2 and 3.

When you mention runtime option, did you mean ioctl/mount/balance 
convert option?

And what's the third one? Default mkfs time option?

If you can make it mkfs time option, it won't be really hard to make it 
configurable.

>
> As I see it, DS would like something like the third option, but CAM
> (ostensibly a SysAdmin) wants the second option.

I didn't consider David means something that.

As far as I read, he means balance convert option along with mkfs option.

>
> On the other hand, I implemented the first option.

At least from what I have learned in recent btrfs development, either we 
provide a good enough interfaces (normally, balance convert ioctl with 
mkfs time option) to configure some on-disk fields.

Or we just leave it to fixed value(normally 0, just like for encryption 
of EXTENT_DATA, and that's the case for current stripe_size).

So fixed kernel value is not a really good idea, and should at least be 
replace by mkfs time option.

>
> The first and third option can co-exit, the second is an orthogonal
> target that needs to be setup separately.
>
> Or we can make all options co-exist, but make it more complicated.

No need.
Just refer to how btrfs kernel handle chunk profile.

It can be specified at mkfs time (by -d and -m options), and can also be 
converted later by balance ioctl. (by btrfs balance convert filter).

The only tricky thing I am a little considered about is, how do we keep 
the default chunk stripe size for a fs.

Thanks,
Qu
>
> Please let me know which implementation is preferable, and, if you just
> want me to expand the description (as DS' mail asked for) or redo the
> entire setup.
>
> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Dec. 30, 2015, 2:10 p.m. UTC | #23
On 12/30/2015 05:54 PM, Sanidhya Solanki wrote:
> On Wed, 30 Dec 2015 19:59:16 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Not really sure about the difference between 2 and 3.
>
> I should have made it clear before, I was asking the exact use case in
> mind when listing the choices. Option 2 would be for SysAdmins running
> production software and configuring it as they desire.
> Option 3 is what we have in the Kernel now, before my patch, where the
> option exists, but it is fixed by the code. You can change it, but you
> need to be someone fairly involved in the upstream work (like a
> distribution Maintainer). This is what my patch implements (well, this
> and option 3).
> Option 1 leaves it as a compile time option.
>
>> When you mention runtime option, did you mean ioctl/mount/balance
>> convert option?
>
> Yes, that is correct.
>
>> And what's the third one? Default mkfs time option?
>> If you can make it mkfs time option, it won't be really hard to make
>> it configurable.
>
> This would be ideal for all use-cases, but make the implementation
> much larger than it would be for the other options. Hence, I asked
> what the exact use case was for the end-user being targeted.
>
>> I didn't consider David means something that.
>> As far as I read, he means balance convert option along with mkfs
>> option.
>
> Hence, why I asked.
>
>> At least from what I have learned in recent btrfs development,
> He> He> either
>> we provide a good enough interfaces (normally, balance convert ioctl
>> with mkfs time option) to configure some on-disk fields.
>
> Just confirming before starting the implementation.
>> So fixed kernel value is not a really good idea, and should at least
>> be replace by mkfs time option.
>
> Will do after confirmation.

Understood now.

Now I am on the same side of David.
Which means a runtime interface to change them. (along with mkfs option)

If provide some configurable features, then it should be able to be 
tuned at both right time and mkfs time.
Or, just don't touch it until there is really enough user demand.
(In stripe_len case, it's also a possible choice, as configurable stripe 
length doesn't really affect much except RAID5/6)


I totally understand that implement will cost you a lot of more time, 
not only kernel part but also user-tool part.

But this also means more patches.
No matter what the motivation for you to contribute to btrfs, more 
patches (except the more time spent) are always good.

More patches, more reputation built in community, and more patches also 
means better split code structures for easier review.
And also you will need to do more debugging/tests, to polish your skill.

Thanks,
Qu

>
> Thanks
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Dec. 30, 2015, 3:58 p.m. UTC | #24
On Wed, Dec 30, 2015 at 06:15:23AM -0500, Sanidhya Solanki wrote:
> Only one problem. I do not run BTRFS on my systems nor do I have a
> RAID setup, due to possessing a limited number of free drives. So, while
> I may be able to code for it, I will not be able to test it. I will need
> the community's help to do the testing.

Multiple devices can be simulated by loop devices or one physical device
partitioned. I'd expect at least some testing on your side, the
community will help with testing, but that's nothing specific to this
patch. This happens all the time.

> I will get started tomorrow.
> 
> To-do (so far):
> - Implement RAID Stripe length as a compile and runtime option.

I was trying to explain that it's not a compile time option.

> - Implement a way to do an in-place Stripe Length change.

How are you going to implement that? I've suggested the balance filter
style of conversion, which is not in-place so I'm curious what do you
mean by in-place.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Dec. 30, 2015, 4:17 p.m. UTC | #25
On Wed, Dec 30, 2015 at 10:10:44PM +0800, Qu Wenruo wrote:
> Now I am on the same side of David.
> Which means a runtime interface to change them. (along with mkfs option)
> 
> If provide some configurable features, then it should be able to be 
> tuned at both right time and mkfs time.
> Or, just don't touch it until there is really enough user demand.
> (In stripe_len case, it's also a possible choice, as configurable stripe 
> length doesn't really affect much except RAID5/6)

I think that we need configurable stripe size regardless. The
performance drop is measurable if the stripe size used by filesystem
does not match the hardware.

> I totally understand that implement will cost you a lot of more time, 
> not only kernel part but also user-tool part.
> 
> But this also means more patches.
> No matter what the motivation for you to contribute to btrfs, more 
> patches (except the more time spent) are always good.
> 
> More patches, more reputation built in community, and more patches also 
> means better split code structures for easier review.

Let me note that a good reputation is also built from patch reviews
(hint hint).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Dec. 30, 2015, 7:48 p.m. UTC | #26
On Wed, 2015-12-30 at 22:10 +0800, Qu Wenruo wrote:
> Or, just don't touch it until there is really enough user demand.
I definitely think that there is demand... as I've written previously,
when I did some benchmarking tests (though on MD and HW raid) then
depending on the RAID chunk size, one got different kind of IO patterns
tuned.

> (In stripe_len case, it's also a possible choice, as configurable
> stripe 
> length doesn't really affect much except RAID5/6)
Sure about that? Admittedly I haven't checked it for the not parity
RAIDs, but I'd expect that for the same reasons you get out different
performance for sequential/random/vector reads/writes, you'd also get
that more or less at least at RAID0.

Cheers,
Chris.
Christoph Anton Mitterer Dec. 30, 2015, 8 p.m. UTC | #27
On Tue, 2015-12-29 at 19:06 +0100, David Sterba wrote:
> > Both open of course many questions (how to deal with crashes,
> > etc.)...
> > maybe having a look at how mdadm handles similar problems could be
> > worth.
> 
> The crash consistency should remain, other than that we'd have to
> enhance the balance filters to process only the unconverted chunks to
> continue.

What about nodatacow'ed files? I'd expect that in case of a crash
during reshaping, these files are (likely) garbage then right?
Not particularly desirable...


But probably that just goes in the direction of the issues/questions I
brought up in the other thread where I've asked the devs for
possibilities in terms of checksumming on nodatacowed areas:

i.e. stability/integrity of such files


For me, speaking with the sysadmin's hat, that was always the main
reason not to do reshapes so far, especially when data is quite
precious, which at least one typical use case for nodatacow is, namely
DBs.

So having crash resistance for CoW + nodataCoW during RAID reshape
should be desired.

Time for a journal in btrfs? O;-)


Cheers,
Chris.
Duncan Dec. 30, 2015, 9:02 p.m. UTC | #28
Christoph Anton Mitterer posted on Wed, 30 Dec 2015 21:00:11 +0100 as
excerpted:

> On Tue, 2015-12-29 at 19:06 +0100, David Sterba wrote:
>> > Both open of course many questions (how to deal with crashes,
>> > etc.)...
>> > maybe having a look at how mdadm handles similar problems could be
>> > worth.
>> 
>> The crash consistency should remain, other than that we'd have to
>> enhance the balance filters to process only the unconverted chunks to
>> continue.
> 
> What about nodatacow'ed files? I'd expect that in case of a crash during
> reshaping, these files are (likely) garbage then right?
> Not particularly desirable...

For something like that, it'd pretty much /have/ to be done as COW, at 
least at the chunk level, tho the address from the outside may stay the 
same.  That's what balance already does, after all.
Christoph Anton Mitterer Dec. 30, 2015, 9:13 p.m. UTC | #29
On Wed, 2015-12-30 at 21:02 +0000, Duncan wrote:
> For something like that, it'd pretty much /have/ to be done as COW,
> at 
> least at the chunk level, tho the address from the outside may stay
> the 
> same.  That's what balance already does, after all.
Ah... of course,... it would be basically CoW1 again... sorry for not
having thought about that :-)

Sounds like a good and stable solution then.

Cheers,
Chris.
Sanidhya Solanki Dec. 30, 2015, 9:19 p.m. UTC | #30
On Wed, 30 Dec 2015 16:58:05 +0100
David Sterba <dsterba@suse.cz> wrote:

> On Wed, Dec 30, 2015 at 06:15:23AM -0500, Sanidhya Solanki wrote:

> > - Implement a way to do an in-place Stripe Length change.
>
> How are you going to implement that? I've suggested the balance filter
> style of conversion, which is not in-place so I'm curious what do you
> mean by in-place.

As CAM suggested, it would basically be a CoW, with a checksum
comparison at the end to make sure no data has been corrupted.

In-place: Without taking the drives or filesystem offline or unmounting
them. Doing the conversion while the rest of the RAID is in use.
Risky, slow, but possible, given enough time for large data sets.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Dec. 30, 2015, 9:21 p.m. UTC | #31
On Wed, 30 Dec 2015 17:17:22 +0100
David Sterba <dsterba@suse.cz> wrote:

> Let me note that a good reputation is also built from patch reviews
> (hint hint).

Unfortunately, not too many patches coming in for BTRFS presently.
Mailing list activity is down to 25-35 mails per day. Mostly feature
and bug requests.

I will try to pitch in with patch reviews where possible.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Dec. 31, 2015, 12:46 a.m. UTC | #32
David Sterba wrote on 2015/12/30 17:17 +0100:
> On Wed, Dec 30, 2015 at 10:10:44PM +0800, Qu Wenruo wrote:
>> Now I am on the same side of David.
>> Which means a runtime interface to change them. (along with mkfs option)
>>
>> If provide some configurable features, then it should be able to be
>> tuned at both right time and mkfs time.
>> Or, just don't touch it until there is really enough user demand.
>> (In stripe_len case, it's also a possible choice, as configurable stripe
>> length doesn't really affect much except RAID5/6)
>
> I think that we need configurable stripe size regardless. The
> performance drop is measurable if the stripe size used by filesystem
> does not match the hardware.

Right, I just missed the benchmark from Christoph and forgot the case of 
RAID 5/6.

>
>> I totally understand that implement will cost you a lot of more time,
>> not only kernel part but also user-tool part.
>>
>> But this also means more patches.
>> No matter what the motivation for you to contribute to btrfs, more
>> patches (except the more time spent) are always good.
>>
>> More patches, more reputation built in community, and more patches also
>> means better split code structures for easier review.
>
> Let me note that a good reputation is also built from patch reviews
> (hint hint).

I must admit I'm a bad reviewer.
As when I review something, I always has an eager to rewrite part or all 
the patch to follow my idea, even it's just a choice between different 
design.

Thanks,
Qu


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Jan. 2, 2016, 11:52 a.m. UTC | #33
On Tue, 29 Dec 2015 19:06:44 +0100
David Sterba <dsterba@suse.cz> wrote:

> In theory this is possible with current on-disk data structures. The
> stripe length is property of btrfs_chunk and changing it should be
> possible the same way we do other raid transformations. The
> implementation might be tricky at some places, but basically boils
> down to the "read-" and "write-" stripe size. Reading chunks would
> always respect the stored size, writing new data would use eg. the
> superblock->stripesize or other value provided by the user.

I was having misgivings about the conversion project, but after
re-reading this part, I will try and get a patch in by Wednesday.

I still have my reservations about the following two parts:
- Checksumming: I have no experience with how the CRC implementation
  would deal with the changed blocksizes. Would the checksum be
  different just because the superblock size has been changed? This
  would make confirming if the transformation was successful much more
  difficult. Another way to deal with this would be ti read the data
  instead and compare it directly, instead of using checksums.

- Performance: Should it have a higher throughput by using larger data
  sizes (which may reduce performance in scenarios such as databases and
  video editing) or by having multiple transformations in parallel on
  smaller data blocks. I am not sure if you can implement things such
  as OpenMP in kernel space. Or spawn multiple kworkers in parallel to
  deal with multiple streams of data.

I am not too worried about dealing with crashes, as we can just
implement something like a table that contains the addresses currently
undergoing changes (which may further reduce throughput, but make it
more space) or do it by using a serial transformation, which ensures a
block was committed to storage before proceeding to the next
transformation.

Essentially a time vs. CPU usage vs. Memory usage trade-off.
Please chime in with your thoughts, developers and administrators.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Jan. 3, 2016, 1:37 a.m. UTC | #34
On 01/02/2016 07:52 PM, Sanidhya Solanki wrote:
> On Tue, 29 Dec 2015 19:06:44 +0100
> David Sterba <dsterba@suse.cz> wrote:
>
>> In theory this is possible with current on-disk data structures. The
>> stripe length is property of btrfs_chunk and changing it should be
>> possible the same way we do other raid transformations. The
>> implementation might be tricky at some places, but basically boils
>> down to the "read-" and "write-" stripe size. Reading chunks would
>> always respect the stored size, writing new data would use eg. the
>> superblock->stripesize or other value provided by the user.
>
> I was having misgivings about the conversion project, but after
> re-reading this part, I will try and get a patch in by Wednesday.
>
> I still have my reservations about the following two parts:
> - Checksumming: I have no experience with how the CRC implementation
>    would deal with the changed blocksizes. Would the checksum be
>    different just because the superblock size has been changed? This
>    would make confirming if the transformation was successful much more
>    difficult. Another way to deal with this would be ti read the data
>    instead and compare it directly, instead of using checksums.

Btrfs checksum are calculated in 3 different method:
1) Metadata: Per nodesize, stored in tree blocker header. (struct 
btrfs_header->csum)
2) Data: Per sectorsize, stored in csum tree.
3) Superblock: Per 4K (fixed), stored in its header (struct 
btrfs_super->csum)

I didn't the need to change any of them, as you are not changing any of 
the csum behavior.

Stripe size only affect how btrfs does IO, not the csum size.

>
> - Performance: Should it have a higher throughput by using larger data
>    sizes (which may reduce performance in scenarios such as databases and
>    video editing) or by having multiple transformations in parallel on
>    smaller data blocks. I am not sure if you can implement things such
>    as OpenMP in kernel space. Or spawn multiple kworkers in parallel to
>    deal with multiple streams of data.

IIRC, btrfs only need to pass bio to devices, and the 
parallel/merge/schedule are all done by kernel bio level.
So you don't really need to bother that much.

And since you are making the stripe size configurable, then user is 
responsible for any too large or too small stripe size setting.

Your only concern would be the default value, but IMHO current 64K 
stripe size is good enough as a default value.

Thanks,
Qu

>
> I am not too worried about dealing with crashes, as we can just
> implement something like a table that contains the addresses currently
> undergoing changes (which may further reduce throughput, but make it
> more space) or do it by using a serial transformation, which ensures a
> block was committed to storage before proceeding to the next
> transformation.
>
> Essentially a time vs. CPU usage vs. Memory usage trade-off.
> Please chime in with your thoughts, developers and administrators.
>
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Jan. 3, 2016, 2:26 a.m. UTC | #35
On Sun, 2016-01-03 at 09:37 +0800, Qu Wenruo wrote:
> And since you are making the stripe size configurable, then user is 
> responsible for any too large or too small stripe size setting.
That pops up the questions, which raid chunk sizes the kernel,
respectively the userland tools should allow for btrfs...

I'd guess only powers of 2, some minimum, some maximum.

Are there any concerns/constraints with too small/too big chunks when
these play together with lower block layers (I'd guess not).

Can one use any device topology information from lower block layers and
make according warnings/suggestions?


> Your only concern would be the default value, but IMHO current 64K 
> stripe size is good enough as a default value.
IIRC mdadm's default was 512K...
Also in my benchmarks the observation was rather that for most IO
patterns, higher chunk sizes perform better (that is at least in MD
RAID and especially HW RAID).
For all our HW RAIDs at the Tier-2 I use the respective maximum
nowadays (1MiB on the newer controllers, 512KiB on some olders, 64KiB
on the ones which are still steam powered).

Only on some special nodes that run a DB I use something lower (IIRC
512KiB or 256KiB).

Best is probably, that once there actually is variable chunk size
support in btrfs, and the different RAID levels have stabilised and
been optimised, and RAID1 renamed to something better (SCNR, ;-P ),...
so probably in some 5-10 years... one makes some more extensive
benchmarks and then picks a reasonable default :-)


Cheers,
Chris.
David Sterba Jan. 5, 2016, 10:16 a.m. UTC | #36
On Thu, Dec 31, 2015 at 08:46:36AM +0800, Qu Wenruo wrote:
> > Let me note that a good reputation is also built from patch reviews
> > (hint hint).
> 
> I must admit I'm a bad reviewer.
> As when I review something, I always has an eager to rewrite part or all 
> the patch to follow my idea, even it's just a choice between different 
> design.

Yeah that's natural, but even if one does not completely agree, it's
still possible to verify that the implementation is correct.

The reviews also help to find and share some common style of
implementation so the maintainers don't scream when they see a patch and
developers and are not suprised that the patches take several rounds.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 5, 2016, 10:33 a.m. UTC | #37
On Wed, Dec 30, 2015 at 04:21:47PM -0500, Sanidhya Solanki wrote:
> On Wed, 30 Dec 2015 17:17:22 +0100
> David Sterba <dsterba@suse.cz> wrote:
> 
> > Let me note that a good reputation is also built from patch reviews
> > (hint hint).
> 
> Unfortunately, not too many patches coming in for BTRFS presently.
> Mailing list activity is down to 25-35 mails per day. Mostly feature
> and bug requests.
> 
> I will try to pitch in with patch reviews where possible.

It was not meant specifically to you, but I won't discourage you from
doing reviews of course. The period where a review is expected can vary
and is bound to the development cycle of kernel. At the latest, they
should come before the integration branch is put togheter (before the
merge window), and for the rc's it's before the next schedule (less than
a week).

The reviewed-by tag has a real meaning and weight in the community

http://lxr.free-electrons.com/source/Documentation/SubmittingPatches#L552

and besides that, subscribes the person to the blame game and can cause
bad feelings if the code turns out to be buggy later on.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 5, 2016, 10:44 a.m. UTC | #38
On Sun, Jan 03, 2016 at 03:26:25AM +0100, Christoph Anton Mitterer wrote:
> On Sun, 2016-01-03 at 09:37 +0800, Qu Wenruo wrote:
> > And since you are making the stripe size configurable, then user is 
> > responsible for any too large or too small stripe size setting.
> That pops up the questions, which raid chunk sizes the kernel,
> respectively the userland tools should allow for btrfs...
> 
> I'd guess only powers of 2, some minimum, some maximum.

We have a full 32 bit number space, so multiples of power of 2 are also
possible if that makes sense. In general we don't need to set additional
limitations besides minimum, maximum and "minimal step".

> Are there any concerns/constraints with too small/too big chunks when
> these play together with lower block layers (I'd guess not).

I don't think so.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Jan. 5, 2016, 6:48 p.m. UTC | #39
On Tue, 2016-01-05 at 11:44 +0100, David Sterba wrote:
> We have a full 32 bit number space, so multiples of power of 2 are
> also
> possible if that makes sense.
Hmm that would make a maximum of 4GiB RAID chunks...
perhaps we should reserve some of the higher bits for a multiplier, in
case 4GiB would ever become too little O;-)


>  In general we don't need to set additional
> limitations besides minimum, maximum and "minimal step".
And that can/should be done in the userland.


> > Are there any concerns/constraints with too small/too big chunks
> > when
> > these play together with lower block layers (I'd guess not).
> 
> I don't think so.

Well I was mainly thinking about dm-crypt, that uses 512B blocks and in
fact that size wouldn't be easy to change, as (IIRC) larger block sizes
make XTS less secure.
Obviously *this* isn't anything that btrfs would have to worry about,
especially as we're anyway on a higher block layer level,.. but it just
reminded me that there can be cases where too large / too small may
actually cause issues.


Cheers,
Chris.
Sanidhya Solanki Jan. 10, 2016, 3:11 a.m. UTC | #40
On Sun, 3 Jan 2016 09:37:04 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Btrfs checksum are calculated in 3 different method:
> 1) Metadata: Per nodesize, stored in tree blocker header. (struct 
> btrfs_header->csum)
> 2) Data: Per sectorsize, stored in csum tree.
> 3) Superblock: Per 4K (fixed), stored in its header (struct 
> btrfs_super->csum)
> 
> I didn't the need to change any of them, as you are not changing any
> of the csum behavior.

Good, that means I do not need to compare csums, just the data at the
end of the re-sizing operation. 

I have finished most of the actual implementation details and 
documentation for the re-size. However, before I undertake the last 
part of the development, the integration of kernel-space and user-space,
as well as the integration between my code and the kernel code. 

> Stripe size only affect how btrfs does IO, not the csum size.

I have a question regarding this statement. I wanted to confirm that
the re-size will result in all the data blocks being re-written. Is
that correct? 
Because the single statement above may also mean that all that needs to
be done is change one option and the kernel code takes care of the rest.

Just clarify that for me.
 
> And since you are making the stripe size configurable, then user is 
> responsible for any too large or too small stripe size setting.

I have also left the option to use an un-conventional stripe size to the
user, for e.g., 768, 3072, 6144, etc.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Jan. 11, 2016, 1:29 a.m. UTC | #41
Sanidhya Solanki wrote on 2016/01/09 22:11 -0500:
> On Sun, 3 Jan 2016 09:37:04 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Btrfs checksum are calculated in 3 different method:
>> 1) Metadata: Per nodesize, stored in tree blocker header. (struct
>> btrfs_header->csum)
>> 2) Data: Per sectorsize, stored in csum tree.
>> 3) Superblock: Per 4K (fixed), stored in its header (struct
>> btrfs_super->csum)
>>
>> I didn't the need to change any of them, as you are not changing any
>> of the csum behavior.
>
> Good, that means I do not need to compare csums, just the data at the
> end of the re-sizing operation.
>
> I have finished most of the actual implementation details and
> documentation for the re-size. However, before I undertake the last
> part of the development, the integration of kernel-space and user-space,
> as well as the integration between my code and the kernel code.
>
>> Stripe size only affect how btrfs does IO, not the csum size.
>
> I have a question regarding this statement. I wanted to confirm that
> the re-size will result in all the data blocks being re-written. Is
> that correct?

AFAIK, Yes for stripe based RAID profile.

And I must admit that, the above statement is not completely right.
It also affect the on-disk data layout. But csum is still not affected.

> Because the single statement above may also mean that all that needs to
> be done is change one option and the kernel code takes care of the rest.
>
> Just clarify that for me.
>
>> And since you are making the stripe size configurable, then user is
>> responsible for any too large or too small stripe size setting.
>
> I have also left the option to use an un-conventional stripe size to the
> user, for e.g., 768, 3072, 6144, etc.

Although personally I prefer to restrict to power of 2, but that's a 
personal choice anyway.

Thanks,
Qu

>
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sanidhya Solanki Jan. 11, 2016, 11:49 a.m. UTC | #42
On Mon, 11 Jan 2016 16:43:35 +0100
Christoph Anton Mitterer <calestyo@scientia.net> wrote:

> On Sat, 2016-01-09 at 22:11 -0500, Sanidhya Solanki wrote:
> > I have also left the option to use an un-conventional stripe size to
> > the
> > user, for e.g., 768, 3072, 6144, etc.
> 
> Not sure whether it's such a good idea to allow this,... at least not
> without requiring some --force option.
> This seems to provoke all kinds of alignment issues,... also, but
> probably not limited to, for any loopdevices created on top of a
> btrfs.

I considered that, but, if you read the specification of modern HDDs
carefully, it states that they are 4K or 512e sector size compatible,
hence the concession. Also, older drive users may find this preferable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Jan. 11, 2016, 3:43 p.m. UTC | #43
On Sat, 2016-01-09 at 22:11 -0500, Sanidhya Solanki wrote:
> I have also left the option to use an un-conventional stripe size to
> the
> user, for e.g., 768, 3072, 6144, etc.

Not sure whether it's such a good idea to allow this,... at least not
without requiring some --force option.
This seems to provoke all kinds of alignment issues,... also, but
probably not limited to, for any loopdevices created on top of a btrfs.

Chris.
Christoph Anton Mitterer Jan. 11, 2016, 3:57 p.m. UTC | #44
On Mon, 2016-01-11 at 06:49 -0500, Sanidhya Solanki wrote:
> I considered that, but, if you read the specification of modern HDDs
> carefully, it states that they are 4K or 512e sector size compatible,
> hence the concession. Also, older drive users may find this
> preferable.
But these are powers of 2,... why does one need 768, 3072, 6144, etc.?
Hugo Mills Jan. 11, 2016, 4:01 p.m. UTC | #45
On Mon, Jan 11, 2016 at 04:57:48PM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-01-11 at 06:49 -0500, Sanidhya Solanki wrote:
> > I considered that, but, if you read the specification of modern HDDs
> > carefully, it states that they are 4K or 512e sector size compatible,
> > hence the concession. Also, older drive users may find this
> > preferable.
> But these are powers of 2,... why does one need 768, 3072, 6144, etc.?

   If you're going to align with erase blocks on SSDs, some SSDs have
the erase block size as 3*2^n. (I don't know if that would be useful,
but it might be.)

   Hugo.
Sanidhya Solanki Jan. 12, 2016, 12:07 p.m. UTC | #46
On Tue, 12 Jan 2016 07:23:01 -0500
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> 768 I don't know, the other two though would provide proper 
> stripe-alignment on a 8 or 14 disk hardware RAID6 setup using 512
> byte sectors (that is, one stripe in BTRFS would span the full array).
> 
> I personally think it should be restricted to multiples of 512, as
> that is the smallest sector size on any disk drive that is
> realistically usable with BTRFS.  That would also provide 4k support,
> and beyond that 64k (which is where the really big drives are headed
> anyway), and also provide the ability to get proper stripe alignment
> on almost any hardware RAID configuration.

Alright, I guess that makes the implementation slightly easier. I will
implement it as only multiples of 512.

Thanks for you input everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn Jan. 12, 2016, 12:23 p.m. UTC | #47
On 2016-01-11 10:57, Christoph Anton Mitterer wrote:
> On Mon, 2016-01-11 at 06:49 -0500, Sanidhya Solanki wrote:
>> I considered that, but, if you read the specification of modern HDDs
>> carefully, it states that they are 4K or 512e sector size compatible,
>> hence the concession. Also, older drive users may find this
>> preferable.
> But these are powers of 2,... why does one need 768, 3072, 6144, etc.?
768 I don't know, the other two though would provide proper 
stripe-alignment on a 8 or 14 disk hardware RAID6 setup using 512 byte 
sectors (that is, one stripe in BTRFS would span the full array).

I personally think it should be restricted to multiples of 512, as that 
is the smallest sector size on any disk drive that is realistically 
usable with BTRFS.  That would also provide 4k support, and beyond that 
64k (which is where the really big drives are headed anyway), and also 
provide the ability to get proper stripe alignment on almost any 
hardware RAID configuration.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..1454d21 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -28,6 +28,48 @@  config BTRFS_FS
 
 	  If unsure, say N.
 
+choice
+         prompt "Choose Stripe Size"
+         default RS_1024
+         help
+           Allows you to select the size of the stripe, which is the smallest sized
+           data block to be replicated.
+           Selecting a larger size than your physical block size may lead to data
+           fragmentation in some cases.
+
+        config RS_512
+                 bool "512 bytes"
+
+        config RS_1024
+                 bool "1024 bytes"
+
+        config RS_2048
+                 bool "2048 bytes"
+
+        config RS_4096
+                 bool "4096 bytes"
+
+        config RS_8192
+                 bool "8192 bytes"
+
+        config RS_16384
+                 bool "16384 bytes"
+
+        config RS_32768
+                 bool "32768 bytes"
+
+endchoice
+
+config BTRFS_RAID_STRIPE
+        int
+        default 512 if RS_512
+        default 1024 if RS_1024
+        default 2048 if RS_2048
+        default 4096 if RS_4096
+        default 8192 if RS_8192
+        default 16384 if RS_16384
+        default 32768 if RS_32768
+
 config BTRFS_FS_POSIX_ACL
 	bool "Btrfs POSIX Access Control Lists"
 	depends on BTRFS_FS
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6d1d0b9..1c4e384 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -17,3 +17,5 @@  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
 	tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o
+
+btrfs-$(CONFIG_BTRFS_RAID_STRIPE) += scrub.o super.o volumes.o
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b091d94..4d0f802 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,10 @@  struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK	16	/* 64k per node/leaf/sector */
 
+#define STRIPE_LENGTH CONFIG_BTRFS_RAID_STRIPE
+
+#define BTRFS_STRIPE_LEN	(64 * STRIPE_LENGTH)
+
 struct scrub_recover {
 	atomic_t		refs;
 	struct btrfs_bio	*bbio;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 24154e4..3d91f8d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -64,6 +64,10 @@ 
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
+#define STRIPE_LENGTH CONFIG_BTRFS_RAID_STRIPE
+
+#define BTRFS_STRIPE_LEN	(64 * STRIPE_LENGTH)
+
 static const struct super_operations btrfs_super_ops;
 static struct file_system_type btrfs_fs_type;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4564522..e1b2e5c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4461,7 +4461,7 @@  static int btrfs_cmp_device_info(const void *a, const void *b)
 static u32 find_raid56_stripe_len(u32 data_devices, u32 dev_stripe_target)
 {
 	/* TODO allow them to set a preferred stripe size */
-	return 64 * 1024;
+	return BTRFS_STRIPE_LEN;
 }
 
 static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d5c84f6..9115a80 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -26,7 +26,9 @@ 
 
 extern struct mutex uuid_mutex;
 
-#define BTRFS_STRIPE_LEN	(64 * 1024)
+#define STRIPE_LENGTH CONFIG_BTRFS_RAID_STRIPE
+
+#define BTRFS_STRIPE_LEN	(64 * STRIPE_LENGTH)
 
 struct buffer_head;
 struct btrfs_pending_bios {