diff mbox

dm raid: ensure metadata IO matches device block size.

Message ID 20141015121907.265b3aed@notabene.brown (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

NeilBrown Oct. 15, 2014, 1:19 a.m. UTC
dm_raid_superblock is 512.
Reading or writing this on a 512-byte sector works fine.
On a 4096-byte sector device, this fails.

If we round up rdev->sb_size to match the block size of
the device, all IO will work correctly.

Reported-by: "Liuhua Wang" <lwang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>

---
this issue has been discussed already a bit. See email thread
 Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
I think this is the best fix.  It handles boths read and writes, and (I think)
at the best level.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mike Snitzer Oct. 15, 2014, 2:55 a.m. UTC | #1
On Tue, Oct 14 2014 at  9:19pm -0400,
NeilBrown <neilb@suse.de> wrote:

> 
> dm_raid_superblock is 512.
> Reading or writing this on a 512-byte sector works fine.
> On a 4096-byte sector device, this fails.
> 
> If we round up rdev->sb_size to match the block size of
> the device, all IO will work correctly.
> 
> Reported-by: "Liuhua Wang" <lwang@suse.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> ---
> this issue has been discussed already a bit. See email thread
>  Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
> I think this is the best fix.  It handles boths read and writes, and (I think)
> at the best level.
> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> index 4880b69e2e9e..31bdd73bc368 100644
> --- a/drivers/md/dm-raid.c
> +++ b/drivers/md/dm-raid.c
> @@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
>  	uint64_t events_sb, events_refsb;
>  
>  	rdev->sb_start = 0;
> -	rdev->sb_size = sizeof(*sb);
> +	rdev->sb_size = roundup(sizeof(*sb),
> +				bdev_logical_block_size(rdev->meta_bdev));
>  
>  	ret = read_disk_sb(rdev, rdev->sb_size);
>  	if (ret)

Wouldn't it be better to use bdev_physical_block_size()?

Even on a 4K device that emulates 512b logical sectors it is better to
use the physical block size (4K).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown Oct. 15, 2014, 3:40 a.m. UTC | #2
On Tue, 14 Oct 2014 22:55:50 -0400 Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Oct 14 2014 at  9:19pm -0400,
> NeilBrown <neilb@suse.de> wrote:
> 
> > 
> > dm_raid_superblock is 512.
> > Reading or writing this on a 512-byte sector works fine.
> > On a 4096-byte sector device, this fails.
> > 
> > If we round up rdev->sb_size to match the block size of
> > the device, all IO will work correctly.
> > 
> > Reported-by: "Liuhua Wang" <lwang@suse.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > 
> > ---
> > this issue has been discussed already a bit. See email thread
> >  Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
> > I think this is the best fix.  It handles boths read and writes, and (I think)
> > at the best level.
> > 
> > Thanks,
> > NeilBrown
> > 
> > 
> > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> > index 4880b69e2e9e..31bdd73bc368 100644
> > --- a/drivers/md/dm-raid.c
> > +++ b/drivers/md/dm-raid.c
> > @@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
> >  	uint64_t events_sb, events_refsb;
> >  
> >  	rdev->sb_start = 0;
> > -	rdev->sb_size = sizeof(*sb);
> > +	rdev->sb_size = roundup(sizeof(*sb),
> > +				bdev_logical_block_size(rdev->meta_bdev));
> >  
> >  	ret = read_disk_sb(rdev, rdev->sb_size);
> >  	if (ret)
> 
> Wouldn't it be better to use bdev_physical_block_size()?
> 
> Even on a 4K device that emulates 512b logical sectors it is better to
> use the physical block size (4K).


_logical_ is the smallest value for which the IO actually works.
And the goal of the change is to make it work.

I don't object to using _physical_, but it isn't clear to me how I would
justify that as "correct".

A big question in my mind is: how much space does LVM reserve in this device
for the metadata?  It seems reasonable to assume that it reserves at least
1 logical block.  If the API guarantees that at least one physical block is
reserved, then that would justify using _physical_.

A quick look at the code shows that the bitmap superblock is placed 4K after
the start of the metadata.
So the code should probably fail if the rounded-up sb_size exceeds 4K.
Mind you, that would exceed PAGE_SIZE too which would cause other problems.

Maybe use _physical_ unless that exceeds 4K, then try _logical_, then fail if
even that > 4K ??

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Oct. 15, 2014, 1:13 p.m. UTC | #3
On Tue, Oct 14 2014 at 11:40pm -0400,
NeilBrown <neilb@suse.de> wrote:

> On Tue, 14 Oct 2014 22:55:50 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Tue, Oct 14 2014 at  9:19pm -0400,
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > 
> > > dm_raid_superblock is 512.
> > > Reading or writing this on a 512-byte sector works fine.
> > > On a 4096-byte sector device, this fails.
> > > 
> > > If we round up rdev->sb_size to match the block size of
> > > the device, all IO will work correctly.
> > > 
> > > Reported-by: "Liuhua Wang" <lwang@suse.com>
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > 
> > > ---
> > > this issue has been discussed already a bit. See email thread
> > >  Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
> > > I think this is the best fix.  It handles boths read and writes, and (I think)
> > > at the best level.
> > > 
> > > Thanks,
> > > NeilBrown
> > > 
> > > 
> > > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> > > index 4880b69e2e9e..31bdd73bc368 100644
> > > --- a/drivers/md/dm-raid.c
> > > +++ b/drivers/md/dm-raid.c
> > > @@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
> > >  	uint64_t events_sb, events_refsb;
> > >  
> > >  	rdev->sb_start = 0;
> > > -	rdev->sb_size = sizeof(*sb);
> > > +	rdev->sb_size = roundup(sizeof(*sb),
> > > +				bdev_logical_block_size(rdev->meta_bdev));
> > >  
> > >  	ret = read_disk_sb(rdev, rdev->sb_size);
> > >  	if (ret)
> > 
> > Wouldn't it be better to use bdev_physical_block_size()?
> > 
> > Even on a 4K device that emulates 512b logical sectors it is better to
> > use the physical block size (4K).
> 
> 
> _logical_ is the smallest value for which the IO actually works.
> And the goal of the change is to make it work.
> 
> I don't object to using _physical_, but it isn't clear to me how I would
> justify that as "correct".
> 
> A big question in my mind is: how much space does LVM reserve in this device
> for the metadata?  It seems reasonable to assume that it reserves at least
> 1 logical block.  If the API guarantees that at least one physical block is
> reserved, then that would justify using _physical_.

I'll have to check with Jon and/or Heinz on this point.

> A quick look at the code shows that the bitmap superblock is placed 4K after
> the start of the metadata.

"the code" being the MD kernel code right?  Any reason not to export a
#define that reflects the space MD reserves and just have dm-raid use that?

Starting to feel like hardcoding 4K is the right thing to do given the
current code.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown Oct. 15, 2014, 9 p.m. UTC | #4
On Wed, 15 Oct 2014 09:13:08 -0400 Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Oct 14 2014 at 11:40pm -0400,
> NeilBrown <neilb@suse.de> wrote:
> 
> > On Tue, 14 Oct 2014 22:55:50 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > On Tue, Oct 14 2014 at  9:19pm -0400,
> > > NeilBrown <neilb@suse.de> wrote:
> > > 
> > > > 
> > > > dm_raid_superblock is 512.
> > > > Reading or writing this on a 512-byte sector works fine.
> > > > On a 4096-byte sector device, this fails.
> > > > 
> > > > If we round up rdev->sb_size to match the block size of
> > > > the device, all IO will work correctly.
> > > > 
> > > > Reported-by: "Liuhua Wang" <lwang@suse.com>
> > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > 
> > > > ---
> > > > this issue has been discussed already a bit. See email thread
> > > >  Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
> > > > I think this is the best fix.  It handles boths read and writes, and (I think)
> > > > at the best level.
> > > > 
> > > > Thanks,
> > > > NeilBrown
> > > > 
> > > > 
> > > > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> > > > index 4880b69e2e9e..31bdd73bc368 100644
> > > > --- a/drivers/md/dm-raid.c
> > > > +++ b/drivers/md/dm-raid.c
> > > > @@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
> > > >  	uint64_t events_sb, events_refsb;
> > > >  
> > > >  	rdev->sb_start = 0;
> > > > -	rdev->sb_size = sizeof(*sb);
> > > > +	rdev->sb_size = roundup(sizeof(*sb),
> > > > +				bdev_logical_block_size(rdev->meta_bdev));
> > > >  
> > > >  	ret = read_disk_sb(rdev, rdev->sb_size);
> > > >  	if (ret)
> > > 
> > > Wouldn't it be better to use bdev_physical_block_size()?
> > > 
> > > Even on a 4K device that emulates 512b logical sectors it is better to
> > > use the physical block size (4K).
> > 
> > 
> > _logical_ is the smallest value for which the IO actually works.
> > And the goal of the change is to make it work.
> > 
> > I don't object to using _physical_, but it isn't clear to me how I would
> > justify that as "correct".
> > 
> > A big question in my mind is: how much space does LVM reserve in this device
> > for the metadata?  It seems reasonable to assume that it reserves at least
> > 1 logical block.  If the API guarantees that at least one physical block is
> > reserved, then that would justify using _physical_.
> 
> I'll have to check with Jon and/or Heinz on this point.
> 
> > A quick look at the code shows that the bitmap superblock is placed 4K after
> > the start of the metadata.
> 
> "the code" being the MD kernel code right?  Any reason not to export a
> #define that reflects the space MD reserves and just have dm-raid use that?

No, "the code" being

	mddev->bitmap_info.offset = 4096 >> 9; /* Enable bitmap creation */
	rdev->mddev->bitmap_info.default_offset = 4096 >> 9;

in super_validate in dm-raid.c.
i.e. dm-raid specific code.
md doesn't reserve space, it just uses what it is told to.  Told either by
mdadm via the md superblock or by dm-raid.

NeilBrown

> 
> Starting to feel like hardcoding 4K is the right thing to do given the
> current code.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Heinz Mauelshagen Oct. 16, 2014, 1:31 p.m. UTC | #5
On 10/15/2014 11:00 PM, NeilBrown wrote:
> On Wed, 15 Oct 2014 09:13:08 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Tue, Oct 14 2014 at 11:40pm -0400,
>> NeilBrown <neilb@suse.de> wrote:
>>
>>> On Tue, 14 Oct 2014 22:55:50 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Tue, Oct 14 2014 at  9:19pm -0400,
>>>> NeilBrown <neilb@suse.de> wrote:
>>>>
>>>>> dm_raid_superblock is 512.
>>>>> Reading or writing this on a 512-byte sector works fine.
>>>>> On a 4096-byte sector device, this fails.
>>>>>
>>>>> If we round up rdev->sb_size to match the block size of
>>>>> the device, all IO will work correctly.
>>>>>
>>>>> Reported-by: "Liuhua Wang" <lwang@suse.com>
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>>
>>>>> ---
>>>>> this issue has been discussed already a bit. See email thread
>>>>>   Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
>>>>> I think this is the best fix.  It handles boths read and writes, and (I think)
>>>>> at the best level.
>>>>>
>>>>> Thanks,
>>>>> NeilBrown
>>>>>
>>>>>
>>>>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>>>>> index 4880b69e2e9e..31bdd73bc368 100644
>>>>> --- a/drivers/md/dm-raid.c
>>>>> +++ b/drivers/md/dm-raid.c
>>>>> @@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
>>>>>   	uint64_t events_sb, events_refsb;
>>>>>   
>>>>>   	rdev->sb_start = 0;
>>>>> -	rdev->sb_size = sizeof(*sb);
>>>>> +	rdev->sb_size = roundup(sizeof(*sb),
>>>>> +				bdev_logical_block_size(rdev->meta_bdev));
>>>>>   
>>>>>   	ret = read_disk_sb(rdev, rdev->sb_size);
>>>>>   	if (ret)
>>>> Wouldn't it be better to use bdev_physical_block_size()?
>>>>
>>>> Even on a 4K device that emulates 512b logical sectors it is better to
>>>> use the physical block size (4K).
>>>
>>> _logical_ is the smallest value for which the IO actually works.
>>> And the goal of the change is to make it work.
>>>
>>> I don't object to using _physical_, but it isn't clear to me how I would
>>> justify that as "correct".
>>>
>>> A big question in my mind is: how much space does LVM reserve in this device
>>> for the metadata?  It seems reasonable to assume that it reserves at least
>>> 1 logical block.  If the API guarantees that at least one physical block is
>>> reserved, then that would justify using _physical_.

dm-raid uses 512 bytes for its superblock including padding in the 
current code
(with way less payload), followed by the bitmap at offset 4096 bytes.

With Neil's proposal, the padding will be avoided alltogether.

So logical looks fine to me, because physical may well be more than 4K 
in the future.
Admittingly a longer time in the future though.

If that happens, we'd end up with the bitmap start at offset 4096 bytes 
in the same physical sector
and would have to cope with it plus the issues involved with 4K page sizes.

In order to prevent this to occur unrecognized in dm/md later,
we should have an "if (rdev->sb_size > PAGE_SIZE) return -EINVAL;" after 
the setting
of rdev->sb_size in dm-raid.c and appropriate checks in md.c.

BTW: even with additions to the dm-raid superblock I have to make in 
order to allow for reshaping etc.,
          it's paylod is going to stay < 512 bytes.

Heinz

>> I'll have to check with Jon and/or Heinz on this point.
>>
>>> A quick look at the code shows that the bitmap superblock is placed 4K after
>>> the start of the metadata.
>> "the code" being the MD kernel code right?  Any reason not to export a
>> #define that reflects the space MD reserves and just have dm-raid use that?
> No, "the code" being
>
> 	mddev->bitmap_info.offset = 4096 >> 9; /* Enable bitmap creation */
> 	rdev->mddev->bitmap_info.default_offset = 4096 >> 9;
>
> in super_validate in dm-raid.c.
> i.e. dm-raid specific code.
> md doesn't reserve space, it just uses what it is told to.  Told either by
> mdadm via the md superblock or by dm-raid.
>
> NeilBrown
>
>> Starting to feel like hardcoding 4K is the right thing to do given the
>> current code.
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Oct. 16, 2014, 7:56 p.m. UTC | #6
On Thu, Oct 16 2014 at  9:31am -0400,
Heinz Mauelshagen <heinzm@redhat.com> wrote:

> 
> On 10/15/2014 11:00 PM, NeilBrown wrote:
> >On Wed, 15 Oct 2014 09:13:08 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >>On Tue, Oct 14 2014 at 11:40pm -0400,
> >>NeilBrown <neilb@suse.de> wrote:
> >>
> >>>On Tue, 14 Oct 2014 22:55:50 -0400 Mike Snitzer <snitzer@redhat.com> wrote:
> >>>
> >>>>On Tue, Oct 14 2014 at  9:19pm -0400,
> >>>>NeilBrown <neilb@suse.de> wrote:
> >>>>
> >>>>>dm_raid_superblock is 512.
> >>>>>Reading or writing this on a 512-byte sector works fine.
> >>>>>On a 4096-byte sector device, this fails.
> >>>>>
> >>>>>If we round up rdev->sb_size to match the block size of
> >>>>>the device, all IO will work correctly.
> >>>>>
> >>>>>Reported-by: "Liuhua Wang" <lwang@suse.com>
> >>>>>Signed-off-by: NeilBrown <neilb@suse.de>
> >>>>>
> >>>>>---
> >>>>>this issue has been discussed already a bit. See email thread
> >>>>>  Subject: Re: [dm-devel] [PATCH] fix mirror device creation with lvcreate failed
> >>>>>I think this is the best fix.  It handles boths read and writes, and (I think)
> >>>>>at the best level.
> >>>>>
> >>>>>Thanks,
> >>>>>NeilBrown
> >>>>>
> >>>>>
> >>>>>diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> >>>>>index 4880b69e2e9e..31bdd73bc368 100644
> >>>>>--- a/drivers/md/dm-raid.c
> >>>>>+++ b/drivers/md/dm-raid.c
> >>>>>@@ -858,7 +858,8 @@ static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
> >>>>>  	uint64_t events_sb, events_refsb;
> >>>>>  	rdev->sb_start = 0;
> >>>>>-	rdev->sb_size = sizeof(*sb);
> >>>>>+	rdev->sb_size = roundup(sizeof(*sb),
> >>>>>+				bdev_logical_block_size(rdev->meta_bdev));
> >>>>>  	ret = read_disk_sb(rdev, rdev->sb_size);
> >>>>>  	if (ret)
> >>>>Wouldn't it be better to use bdev_physical_block_size()?
> >>>>
> >>>>Even on a 4K device that emulates 512b logical sectors it is better to
> >>>>use the physical block size (4K).
> >>>
> >>>_logical_ is the smallest value for which the IO actually works.
> >>>And the goal of the change is to make it work.
> >>>
> >>>I don't object to using _physical_, but it isn't clear to me how I would
> >>>justify that as "correct".
> >>>
> >>>A big question in my mind is: how much space does LVM reserve in this device
> >>>for the metadata?  It seems reasonable to assume that it reserves at least
> >>>1 logical block.  If the API guarantees that at least one physical block is
> >>>reserved, then that would justify using _physical_.
> 
> dm-raid uses 512 bytes for its superblock including padding in the
> current code
> (with way less payload), followed by the bitmap at offset 4096 bytes.
> 
> With Neil's proposal, the padding will be avoided alltogether.
> 
> So logical looks fine to me, because physical may well be more than
> 4K in the future.
> Admittingly a longer time in the future though.
> 
> If that happens, we'd end up with the bitmap start at offset 4096
> bytes in the same physical sector
> and would have to cope with it plus the issues involved with 4K page sizes.
> 
> In order to prevent this to occur unrecognized in dm/md later,
> we should have an "if (rdev->sb_size > PAGE_SIZE) return -EINVAL;"
> after the setting
> of rdev->sb_size in dm-raid.c and appropriate checks in md.c.
> 
> BTW: even with additions to the dm-raid superblock I have to make in
> order to allow for reshaping etc.,
>          it's paylod is going to stay < 512 bytes.

Heinz,

If you could propose a revised patch I'd appreciate it.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 4880b69e2e9e..31bdd73bc368 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -858,7 +858,8 @@  static int super_load(struct md_rdev *rdev, struct md_rdev *refdev)
 	uint64_t events_sb, events_refsb;
 
 	rdev->sb_start = 0;
-	rdev->sb_size = sizeof(*sb);
+	rdev->sb_size = roundup(sizeof(*sb),
+				bdev_logical_block_size(rdev->meta_bdev));
 
 	ret = read_disk_sb(rdev, rdev->sb_size);
 	if (ret)