mbox series

[v9,0/3] readmirror feature (read_policy sysfs and in-memory only approach)

Message ID cover.1603347462.git.anand.jain@oracle.com (mailing list archive)
Headers show
Series readmirror feature (read_policy sysfs and in-memory only approach) | expand

Message

Anand Jain Oct. 22, 2020, 7:43 a.m. UTC
v9: C coding style fixes in 1/3 and 3/3

v8:
Separate the sysfs framework and the %pid read_policy into a separate
patchset here, so that the new read policies can be in its own patch set.

A latency based read_policy is being prepared to send it in a separate
patchset as it depends on a few changes in the block layer as well.

__ Original email: __

v7:
Fix switch's fall through warning. Changle logs updates where necessary.

v6:
Patch 4/5 - If there is no change in device's read prefer then don't log
Patch 4/5 - Add pid to the logs
Patch 5/5 - If there isn't read preferred device in the chunk don't reset
read policy to default, instead just use stripe 0. As this is in
the read path it avoids going through the device list to find
read preferred device. So inline to this drop to check if there
is read preferred device before setting read policy to device.

v5:
Worked on review comments as received in its previous version.
Please refer to individual patches for the specific changes.
Introduces the new read_policy 'device'.

v4:
Rename readmirror attribute to read_policy. Drop separate kobj for
readmirror instead create read_policy attribute in fsid kobj.
merge v2:2/3 and v2:3/3 into v4:2/2. Patch titles have changed.
 
v3:
v2:
Mainly fixes the fs_devices::readmirror declaration type from atomic_t
to u8. (Thanks Josef).

v1:
As of now we use only %pid method to read stripped mirrored data. So
application's process id determines the stripe id to be read. This type
of routing typically helps in a system with many small independent
applications tying to read random data. On the other hand the %pid
based read IO distribution policy is inefficient if there is a single
application trying to read large data as because the overall disk
bandwidth would remains under utilized.

One type of readmirror policy isn't good enough and other choices are
routing the IO based on device's waitqueue or manual when we have a
read-preferred device or a readmirror policy based on the target storage
caching. So this patch-set introduces a framework where we could add more
readmirror policies.

This policy is a filesystem wide policy as of now, and though the
readmirror policy at the subvolume level is a novel approach as it
provides maximum flexibility in the data center, but as of now its not
practical to implement such a granularity as you can't really ensure
reflinked extents will be read from the stripe of its desire and so
there will be more limitations and it can be assessed separately.

The approach in this patch-set is sys interface with in-memory policy.
And does not add any new readmirror type in this set, which can be add
once we are ok with the framework. Also the default policy remains %pid.

Previous works:
----------------------------------------------------------------------
There were few RFCs [1] before, mainly to figure out storage
(or in memory only) for the readmirror policy and the interface needed.

[1]
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg86368.html

https://lore.kernel.org/linux-btrfs/20190826090438.7044-1-anand.jain@oracle.com/

https://lore.kernel.org/linux-btrfs/5fcf9c23-89b5-b167-1f80-a0f4ac107d0b@oracle.com/

https://patchwork.kernel.org/cover/10859213/

Mount -o:
In the first trial it was attempted to use the mount -o option to carry
the readmirror policy, this is good for debugging which can make sure
even the mount thread metadata tree blocks are read from the disk desired.
It was very effective in testing radi1/raid10 write-holes.

Extended attribute:
As extended attribute is associated with the inode, to implement this
there is bit of extended attribute abuse or else makes it mandatory to
mount the rootid 5. Its messy unless readmirror policy is applied at the
subvol level which is not possible as of now. 

An item type:
The proposed patch was to create an item to hold the readmirror policy,
it makes sense when compared to the abusive extended attribute approach
but introduces a new item and so no backward compatibility.
-----------------------------------------------------------------------

Anand Jain (3):
  btrfs: add btrfs_strmatch helper
  btrfs: create read policy framework
  btrfs: create read policy sysfs attribute, pid

 fs/btrfs/sysfs.c   | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.c | 15 +++++++++-
 fs/btrfs/volumes.h | 14 ++++++++++
 3 files changed, 97 insertions(+), 1 deletion(-)

Comments

David Sterba Oct. 23, 2020, 5:04 p.m. UTC | #1
On Thu, Oct 22, 2020 at 03:43:34PM +0800, Anand Jain wrote:
> v9: C coding style fixes in 1/3 and 3/3

So the point of adding the sysfs knobs is to allow testing various
mirror selection strategies, what exactly was discussed in the past. Do
you have patches for that as well? It does not need to be final and
polished but at least give us something to test.
Josef Bacik Oct. 26, 2020, 2:01 p.m. UTC | #2
On 10/22/20 3:43 AM, Anand Jain wrote:
> v9: C coding style fixes in 1/3 and 3/3
> 

You can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to this series, lets get this in so we can focus on the actual changes to the 
policy.  Thanks,

Josef
Anand Jain Oct. 27, 2020, 1:52 a.m. UTC | #3
On 24/10/20 1:04 am, David Sterba wrote:
> On Thu, Oct 22, 2020 at 03:43:34PM +0800, Anand Jain wrote:
>> v9: C coding style fixes in 1/3 and 3/3
> 
> So the point of adding the sysfs knobs is to allow testing various
> mirror selection strategies, what exactly was discussed in the past. Do
> you have patches for that as well? It does not need to be final and
> polished but at least give us something to test.
> 

Sure. I just sent out the patchset [1]. It provides read_policy: 
latency, device, and round-robin.

[1]
https://patchwork.kernel.org/project/linux-btrfs/list/?series=371023

Thanks, Anand
David Sterba Oct. 27, 2020, 6:02 p.m. UTC | #4
On Tue, Oct 27, 2020 at 09:52:11AM +0800, Anand Jain wrote:
> 
> 
> On 24/10/20 1:04 am, David Sterba wrote:
> > On Thu, Oct 22, 2020 at 03:43:34PM +0800, Anand Jain wrote:
> >> v9: C coding style fixes in 1/3 and 3/3
> > 
> > So the point of adding the sysfs knobs is to allow testing various
> > mirror selection strategies, what exactly was discussed in the past. Do
> > you have patches for that as well? It does not need to be final and
> > polished but at least give us something to test.
> > 
> 
> Sure. I just sent out the patchset [1]. It provides read_policy: 
> latency, device, and round-robin.
> 
> [1]
> https://patchwork.kernel.org/project/linux-btrfs/list/?series=371023

Exporting more information from the devices would help us to decide but
is there anything we can do just with what we have? Or eventually add
our counters like for the in-flight requests or total bytes. The
intention is not to duplicate what block layer does as we need to
experiment with the stats and heuristics it's just for that purpose and
we don't have to rely on other subsystem patches.
Anand Jain Oct. 28, 2020, 12:06 p.m. UTC | #5
On 28/10/20 2:02 am, David Sterba wrote:
> On Tue, Oct 27, 2020 at 09:52:11AM +0800, Anand Jain wrote:
>>
>>
>> On 24/10/20 1:04 am, David Sterba wrote:
>>> On Thu, Oct 22, 2020 at 03:43:34PM +0800, Anand Jain wrote:
>>>> v9: C coding style fixes in 1/3 and 3/3
>>>
>>> So the point of adding the sysfs knobs is to allow testing various
>>> mirror selection strategies, what exactly was discussed in the past. Do
>>> you have patches for that as well? It does not need to be final and
>>> polished but at least give us something to test.
>>>
>>
>> Sure. I just sent out the patchset [1]. It provides read_policy:
>> latency, device, and round-robin.
>>
>> [1]
>> https://patchwork.kernel.org/project/linux-btrfs/list/?series=371023
> 
> Exporting more information from the devices would help us to decide but
> is there anything we can do just with what we have? Or eventually add
> our counters like for the in-flight requests or total bytes. The
> intention is not to duplicate what block layer does as we need to
> experiment with the stats and heuristics it's just for that purpose and
> we don't have to rely on other subsystem patches.
> 

I could rewrote the latency patch without export of any new block layer 
functions. Patchset v1 will soon be sent to the ML. I hope that
shall address your concern.

Thanks, Anand