[PATCHv5,00/14] dm-zoned: metadata version 2

Message ID	20200508090332.40716-1-hare@suse.de (mailing list archive)
Headers	show Return-Path: <SRS0=htUG=6W=redhat.com=dm-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C460920731 From: Hannes Reinecke <hare@suse.de> To: Mike Snitzer <snitzer@redhat.com> Date: Fri, 8 May 2020 11:03:18 +0200 Message-Id: <20200508090332.40716-1-hare@suse.de> Cc: Damien LeMoal <damien.lemoal@wdc.com>, Bob Liu <bob.liu@oracle.com>, dm-devel@redhat.com Subject: [dm-devel] [PATCHv5 00/14] dm-zoned: metadata version 2 Precedence: junk MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	dm-zoned: metadata version 2 \| expand [PATCHv5,00/14] dm-zoned: metadata version 2 [01/14] dm-zoned: add 'status' and 'message' callbacks [02/14] dm-zoned: store zone id within the zone structure and kill dmz_id() [03/14] dm-zoned: use array for superblock zones [04/14] dm-zoned: store device in struct dmz_sb [05/14] dm-zoned: move fields from struct dmz_dev to dmz_metadata [06/14] dm-zoned: introduce dmz_metadata_label() to format device name [07/14] dm-zoned: Introduce dmz_dev_is_dying() and dmz_check_dev() [08/14] dm-zoned: remove 'dev' argument from reclaim [09/14] dm-zoned: replace 'target' pointer in the bio context [10/14] dm-zoned: use dmz_zone_to_dev() when handling metadata I/O [11/14] dm-zoned: add metadata logging functions [12/14] dm-zoned: Reduce logging output on startup [13/14] dm-zoned: ignore metadata zone in dmz_alloc_zone() [14/14] dm-zoned: metadata version 2

Hannes Reinecke May 8, 2020, 9:03 a.m. UTC

Hi all,

this patchset adds a new metadata version 2 for dm-zoned, which brings the
following improvements:

- UUIDs and labels: Adding three more fields to the metadata containing
  the dm-zoned device UUID and label, and the device UUID. This allows
  for an unique identification of the devices, so that several dm-zoned
  sets can coexist and have a persistent identification.
- Extend random zones by an additional regular disk device: A regular
  block device can be added together with the zoned block device, providing
  additional (emulated) random write zones. With this it's possible to
  handle sequential zones only devices; also there will be a speed-up if
  the regular block device resides on a fast medium. The regular block device
  is placed logically in front of the zoned block device, so that metadata
  and mapping tables reside on the regular block device, not the zoned device.
- Tertiary superblock support: In addition to the two existing sets of metadata
  another, tertiary, superblock is written to the first block of the zoned
  block device. This superblock is for identification only; the generation
  number is set to '0' and the block itself it never updated. The addition
  metadate like bitmap tables etc are not copied.

To handle this, some changes to the original handling are introduced:
- Zones are now equidistant. Originally, runt zones were ignored, and
  not counted when sizing the mapping tables. With the dual device setup
  runt zones might occur at the end of the regular block device, making
  direct translation between zone number and sector/block number complex.
  For metadata version 2 all zones are considered to be of the same size,
  and runt zones are simply marked as 'offline' to have them ignored when
  allocating a new zone.
- The block number in the superblock is now the global number, and refers to
  the location of the superblock relative to the resulting device-mapper
  device. Which means that the tertiary superblock contains absolute block
  addresses, which needs to be translated to the relative device addresses
  to find the referenced block.

There is an accompanying patchset for dm-zoned-tools for writing and checking
this new metadata.

As usual, comments and reviews are welcome.

Changes to v4:
- Add reviews from Damien
- Silence logging output as suggested by Mike Snitzer
- Fixup compilation on 32bit archs

Changes to v3:
- Reorder devices such that the regular device is always at position 0,
  and the zoned device is always at position 1.
- Split off dmz_dev_is_dying() into a separate patch
- Include reviews from Damien

Changes to v2:
- Kill dmz_id()
- Include reviews from Damien
- Sanitize uuid handling as suggested by John Dorminy


Hannes Reinecke (14):
  dm-zoned: add 'status' and 'message' callbacks
  dm-zoned: store zone id within the zone structure and kill dmz_id()
  dm-zoned: use array for superblock zones
  dm-zoned: store device in struct dmz_sb
  dm-zoned: move fields from struct dmz_dev to dmz_metadata
  dm-zoned: introduce dmz_metadata_label() to format device name
  dm-zoned: Introduce dmz_dev_is_dying() and dmz_check_dev()
  dm-zoned: remove 'dev' argument from reclaim
  dm-zoned: replace 'target' pointer in the bio context
  dm-zoned: use dmz_zone_to_dev() when handling metadata I/O
  dm-zoned: add metadata logging functions
  dm-zoned: Reduce logging output on startup
  dm-zoned: ignore metadata zone in dmz_alloc_zone()
  dm-zoned: metadata version 2

 drivers/md/dm-zoned-metadata.c | 664 +++++++++++++++++++++++++++++++----------
 drivers/md/dm-zoned-reclaim.c  |  88 +++---
 drivers/md/dm-zoned-target.c   | 376 +++++++++++++++--------
 drivers/md/dm-zoned.h          |  35 ++-
 4 files changed, 825 insertions(+), 338 deletions(-)

Damien Le Moal May 11, 2020, 2:46 a.m. UTC | #1

On 2020/05/08 18:03, Hannes Reinecke wrote:
> Hi all,
> 
> this patchset adds a new metadata version 2 for dm-zoned, which brings the
> following improvements:
> 
> - UUIDs and labels: Adding three more fields to the metadata containing
>   the dm-zoned device UUID and label, and the device UUID. This allows
>   for an unique identification of the devices, so that several dm-zoned
>   sets can coexist and have a persistent identification.
> - Extend random zones by an additional regular disk device: A regular
>   block device can be added together with the zoned block device, providing
>   additional (emulated) random write zones. With this it's possible to
>   handle sequential zones only devices; also there will be a speed-up if
>   the regular block device resides on a fast medium. The regular block device
>   is placed logically in front of the zoned block device, so that metadata
>   and mapping tables reside on the regular block device, not the zoned device.
> - Tertiary superblock support: In addition to the two existing sets of metadata
>   another, tertiary, superblock is written to the first block of the zoned
>   block device. This superblock is for identification only; the generation
>   number is set to '0' and the block itself it never updated. The addition
>   metadate like bitmap tables etc are not copied.
> 
> To handle this, some changes to the original handling are introduced:
> - Zones are now equidistant. Originally, runt zones were ignored, and
>   not counted when sizing the mapping tables. With the dual device setup
>   runt zones might occur at the end of the regular block device, making
>   direct translation between zone number and sector/block number complex.
>   For metadata version 2 all zones are considered to be of the same size,
>   and runt zones are simply marked as 'offline' to have them ignored when
>   allocating a new zone.
> - The block number in the superblock is now the global number, and refers to
>   the location of the superblock relative to the resulting device-mapper
>   device. Which means that the tertiary superblock contains absolute block
>   addresses, which needs to be translated to the relative device addresses
>   to find the referenced block.
> 
> There is an accompanying patchset for dm-zoned-tools for writing and checking
> this new metadata.
> 
> As usual, comments and reviews are welcome.

I gave this series a good round of testing. See the attached picture for the
results. The test is this:
1) Setup dm-zoned
2) Format and mount with mkfs.ext4 -E packed_meta_blocks=1 /dev/mapper/xxx
3) Create file random in size between 1 and 4MB and measure user seen throughput
over 100 files.
3) Run that for 2 hours

I ran this over a 15TB SMR drive single drive setup, and on the same drive + a
500GB m.2 ssd added.

For the single drive case, the usual 3 phases can be seen: start writing at
about 110MB/s, everything going to conventional zones (note conv zones are in
the middle of the disk, hence the low-ish throughput). Then after about 400s,
reclaim kicks in and the throughput drops to 60-70 MB/s. As reclaim cannot keep
up under this heavy write workload, performance drops to 20-30MB/s after 800s.
All good, without any idle time for reclaim to do its job, this is all expected.

For the dual drive case, things are more interesting:
1) The first phase is longer as overall, there is more conventional space (500G
ssd + 400G on SMR drive). So we see the SSD speed first (~425MB/s), then the
drive speed (100 MB/s), slightly lower than the single drive case toward the end
as reclaim triggers.
2) Some recovery back to ssd speed, then a long phase at half the speed of the
ssd as writes go to ssd and reclaim is running moving data out of the ssd onto
the disk.
3) Then a long phase at 25MB/s due to SMR disk reclaim.
4) back up to half the ssd speed.

No crashes, no data corruption, all good. But is does look like we can improve
on performance further by preventing using the drive conventional zones as
"buffer" zones. If we let those be the final resting place of data, the SMR disk
only reclaim would not kick in and hurt performance as seen here. That I think
can all be done on top of this series though. Let's get this in first.

Mike,

I am still seeing the warning:

[ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
alignment_offset=0, start=0
[ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
alignment_offset=0, start=0
[ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
alignment_offset=0, start=0
[ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
alignment_offset=0, start=0

when mixing 512B sector and 4KB sector devices. Investigating now.

Hannes,

I pushed some minor updates to dmzadm staging branch on top of your changes.

> 
> Changes to v4:
> - Add reviews from Damien
> - Silence logging output as suggested by Mike Snitzer
> - Fixup compilation on 32bit archs
> 
> Changes to v3:
> - Reorder devices such that the regular device is always at position 0,
>   and the zoned device is always at position 1.
> - Split off dmz_dev_is_dying() into a separate patch
> - Include reviews from Damien
> 
> Changes to v2:
> - Kill dmz_id()
> - Include reviews from Damien
> - Sanitize uuid handling as suggested by John Dorminy
> 
> 
> Hannes Reinecke (14):
>   dm-zoned: add 'status' and 'message' callbacks
>   dm-zoned: store zone id within the zone structure and kill dmz_id()
>   dm-zoned: use array for superblock zones
>   dm-zoned: store device in struct dmz_sb
>   dm-zoned: move fields from struct dmz_dev to dmz_metadata
>   dm-zoned: introduce dmz_metadata_label() to format device name
>   dm-zoned: Introduce dmz_dev_is_dying() and dmz_check_dev()
>   dm-zoned: remove 'dev' argument from reclaim
>   dm-zoned: replace 'target' pointer in the bio context
>   dm-zoned: use dmz_zone_to_dev() when handling metadata I/O
>   dm-zoned: add metadata logging functions
>   dm-zoned: Reduce logging output on startup
>   dm-zoned: ignore metadata zone in dmz_alloc_zone()
>   dm-zoned: metadata version 2
> 
>  drivers/md/dm-zoned-metadata.c | 664 +++++++++++++++++++++++++++++++----------
>  drivers/md/dm-zoned-reclaim.c  |  88 +++---
>  drivers/md/dm-zoned-target.c   | 376 +++++++++++++++--------
>  drivers/md/dm-zoned.h          |  35 ++-
>  4 files changed, 825 insertions(+), 338 deletions(-)
>

Hannes Reinecke May 11, 2020, 6:31 a.m. UTC | #2

On 5/11/20 4:46 AM, Damien Le Moal wrote:
> On 2020/05/08 18:03, Hannes Reinecke wrote:
>> Hi all,
>>
>> this patchset adds a new metadata version 2 for dm-zoned, which brings the
>> following improvements:
>>
>> - UUIDs and labels: Adding three more fields to the metadata containing
>>    the dm-zoned device UUID and label, and the device UUID. This allows
>>    for an unique identification of the devices, so that several dm-zoned
>>    sets can coexist and have a persistent identification.
>> - Extend random zones by an additional regular disk device: A regular
>>    block device can be added together with the zoned block device, providing
>>    additional (emulated) random write zones. With this it's possible to
>>    handle sequential zones only devices; also there will be a speed-up if
>>    the regular block device resides on a fast medium. The regular block device
>>    is placed logically in front of the zoned block device, so that metadata
>>    and mapping tables reside on the regular block device, not the zoned device.
>> - Tertiary superblock support: In addition to the two existing sets of metadata
>>    another, tertiary, superblock is written to the first block of the zoned
>>    block device. This superblock is for identification only; the generation
>>    number is set to '0' and the block itself it never updated. The addition
>>    metadate like bitmap tables etc are not copied.
>>
>> To handle this, some changes to the original handling are introduced:
>> - Zones are now equidistant. Originally, runt zones were ignored, and
>>    not counted when sizing the mapping tables. With the dual device setup
>>    runt zones might occur at the end of the regular block device, making
>>    direct translation between zone number and sector/block number complex.
>>    For metadata version 2 all zones are considered to be of the same size,
>>    and runt zones are simply marked as 'offline' to have them ignored when
>>    allocating a new zone.
>> - The block number in the superblock is now the global number, and refers to
>>    the location of the superblock relative to the resulting device-mapper
>>    device. Which means that the tertiary superblock contains absolute block
>>    addresses, which needs to be translated to the relative device addresses
>>    to find the referenced block.
>>
>> There is an accompanying patchset for dm-zoned-tools for writing and checking
>> this new metadata.
>>
>> As usual, comments and reviews are welcome.
> 
> I gave this series a good round of testing. See the attached picture for the
> results. The test is this:
> 1) Setup dm-zoned
> 2) Format and mount with mkfs.ext4 -E packed_meta_blocks=1 /dev/mapper/xxx
> 3) Create file random in size between 1 and 4MB and measure user seen throughput
> over 100 files.
> 3) Run that for 2 hours
> 
> I ran this over a 15TB SMR drive single drive setup, and on the same drive + a
> 500GB m.2 ssd added.
> 
> For the single drive case, the usual 3 phases can be seen: start writing at
> about 110MB/s, everything going to conventional zones (note conv zones are in
> the middle of the disk, hence the low-ish throughput). Then after about 400s,
> reclaim kicks in and the throughput drops to 60-70 MB/s. As reclaim cannot keep
> up under this heavy write workload, performance drops to 20-30MB/s after 800s.
> All good, without any idle time for reclaim to do its job, this is all expected.
> 
> For the dual drive case, things are more interesting:
> 1) The first phase is longer as overall, there is more conventional space (500G
> ssd + 400G on SMR drive). So we see the SSD speed first (~425MB/s), then the
> drive speed (100 MB/s), slightly lower than the single drive case toward the end
> as reclaim triggers.
> 2) Some recovery back to ssd speed, then a long phase at half the speed of the
> ssd as writes go to ssd and reclaim is running moving data out of the ssd onto
> the disk.
> 3) Then a long phase at 25MB/s due to SMR disk reclaim.
> 4) back up to half the ssd speed.
> 
> No crashes, no data corruption, all good. But is does look like we can improve
> on performance further by preventing using the drive conventional zones as
> "buffer" zones. If we let those be the final resting place of data, the SMR disk
> only reclaim would not kick in and hurt performance as seen here. That I think
> can all be done on top of this series though. Let's get this in first.
> 
Thanks for the data! That indeed is very interesting; guess I'll do some 
tests here on my setup, too.
(And hope it doesn't burn my NVDIMM ...)

But, guess what, I had the some thoughts; we should be treating the 
random zones more like sequential zones in a two-disk setup.
So guess I'll be resurrecting the idea from my very first patch and 
implement 'cache' zones in addition to the existing 'random' and 
'sequential' zones.
But, as you said, that'll be a next series of patches.

What program did you use as a load generator?

Cheers,

Hannes

Damien Le Moal May 11, 2020, 6:41 a.m. UTC | #3

On 2020/05/11 15:31, Hannes Reinecke wrote:
> On 5/11/20 4:46 AM, Damien Le Moal wrote:
>> On 2020/05/08 18:03, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> this patchset adds a new metadata version 2 for dm-zoned, which brings the
>>> following improvements:
>>>
>>> - UUIDs and labels: Adding three more fields to the metadata containing
>>>    the dm-zoned device UUID and label, and the device UUID. This allows
>>>    for an unique identification of the devices, so that several dm-zoned
>>>    sets can coexist and have a persistent identification.
>>> - Extend random zones by an additional regular disk device: A regular
>>>    block device can be added together with the zoned block device, providing
>>>    additional (emulated) random write zones. With this it's possible to
>>>    handle sequential zones only devices; also there will be a speed-up if
>>>    the regular block device resides on a fast medium. The regular block device
>>>    is placed logically in front of the zoned block device, so that metadata
>>>    and mapping tables reside on the regular block device, not the zoned device.
>>> - Tertiary superblock support: In addition to the two existing sets of metadata
>>>    another, tertiary, superblock is written to the first block of the zoned
>>>    block device. This superblock is for identification only; the generation
>>>    number is set to '0' and the block itself it never updated. The addition
>>>    metadate like bitmap tables etc are not copied.
>>>
>>> To handle this, some changes to the original handling are introduced:
>>> - Zones are now equidistant. Originally, runt zones were ignored, and
>>>    not counted when sizing the mapping tables. With the dual device setup
>>>    runt zones might occur at the end of the regular block device, making
>>>    direct translation between zone number and sector/block number complex.
>>>    For metadata version 2 all zones are considered to be of the same size,
>>>    and runt zones are simply marked as 'offline' to have them ignored when
>>>    allocating a new zone.
>>> - The block number in the superblock is now the global number, and refers to
>>>    the location of the superblock relative to the resulting device-mapper
>>>    device. Which means that the tertiary superblock contains absolute block
>>>    addresses, which needs to be translated to the relative device addresses
>>>    to find the referenced block.
>>>
>>> There is an accompanying patchset for dm-zoned-tools for writing and checking
>>> this new metadata.
>>>
>>> As usual, comments and reviews are welcome.
>>
>> I gave this series a good round of testing. See the attached picture for the
>> results. The test is this:
>> 1) Setup dm-zoned
>> 2) Format and mount with mkfs.ext4 -E packed_meta_blocks=1 /dev/mapper/xxx
>> 3) Create file random in size between 1 and 4MB and measure user seen throughput
>> over 100 files.
>> 3) Run that for 2 hours
>>
>> I ran this over a 15TB SMR drive single drive setup, and on the same drive + a
>> 500GB m.2 ssd added.
>>
>> For the single drive case, the usual 3 phases can be seen: start writing at
>> about 110MB/s, everything going to conventional zones (note conv zones are in
>> the middle of the disk, hence the low-ish throughput). Then after about 400s,
>> reclaim kicks in and the throughput drops to 60-70 MB/s. As reclaim cannot keep
>> up under this heavy write workload, performance drops to 20-30MB/s after 800s.
>> All good, without any idle time for reclaim to do its job, this is all expected.
>>
>> For the dual drive case, things are more interesting:
>> 1) The first phase is longer as overall, there is more conventional space (500G
>> ssd + 400G on SMR drive). So we see the SSD speed first (~425MB/s), then the
>> drive speed (100 MB/s), slightly lower than the single drive case toward the end
>> as reclaim triggers.
>> 2) Some recovery back to ssd speed, then a long phase at half the speed of the
>> ssd as writes go to ssd and reclaim is running moving data out of the ssd onto
>> the disk.
>> 3) Then a long phase at 25MB/s due to SMR disk reclaim.
>> 4) back up to half the ssd speed.
>>
>> No crashes, no data corruption, all good. But is does look like we can improve
>> on performance further by preventing using the drive conventional zones as
>> "buffer" zones. If we let those be the final resting place of data, the SMR disk
>> only reclaim would not kick in and hurt performance as seen here. That I think
>> can all be done on top of this series though. Let's get this in first.
>>
> Thanks for the data! That indeed is very interesting; guess I'll do some 
> tests here on my setup, too.
> (And hope it doesn't burn my NVDIMM ...)
> 
> But, guess what, I had the some thoughts; we should be treating the 
> random zones more like sequential zones in a two-disk setup.
> So guess I'll be resurrecting the idea from my very first patch and 
> implement 'cache' zones in addition to the existing 'random' and 
> 'sequential' zones.
> But, as you said, that'll be a next series of patches.
> 
> What program did you use as a load generator?

My own "filebench" program :)
It it just a simple tool that write files using some thread, IO size, IO type,
sync or not, fsync or not, etc. Various parameters possible. The same can be
done with fio.

The parameters I used are:
- random (uniform distribution) file size between 1MiB and 4 MiB
- Files written using buffered IOs with 4K writes (I know ! just trying to be
nasty to dm-zoned)
- 20 writer threads each writing 10 files: each data point is the bandwidth of
those 20 threads/200 files being written (above I said 100 files... My bad. It
is 200).
- For each data point, create a new directory to not end up with super large
directories slowing down the FS

Damien Le Moal May 11, 2020, 6:55 a.m. UTC | #4

On 2020/05/11 15:31, Hannes Reinecke wrote:
> On 5/11/20 4:46 AM, Damien Le Moal wrote:
>> On 2020/05/08 18:03, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> this patchset adds a new metadata version 2 for dm-zoned, which brings the
>>> following improvements:
>>>
>>> - UUIDs and labels: Adding three more fields to the metadata containing
>>>    the dm-zoned device UUID and label, and the device UUID. This allows
>>>    for an unique identification of the devices, so that several dm-zoned
>>>    sets can coexist and have a persistent identification.
>>> - Extend random zones by an additional regular disk device: A regular
>>>    block device can be added together with the zoned block device, providing
>>>    additional (emulated) random write zones. With this it's possible to
>>>    handle sequential zones only devices; also there will be a speed-up if
>>>    the regular block device resides on a fast medium. The regular block device
>>>    is placed logically in front of the zoned block device, so that metadata
>>>    and mapping tables reside on the regular block device, not the zoned device.
>>> - Tertiary superblock support: In addition to the two existing sets of metadata
>>>    another, tertiary, superblock is written to the first block of the zoned
>>>    block device. This superblock is for identification only; the generation
>>>    number is set to '0' and the block itself it never updated. The addition
>>>    metadate like bitmap tables etc are not copied.
>>>
>>> To handle this, some changes to the original handling are introduced:
>>> - Zones are now equidistant. Originally, runt zones were ignored, and
>>>    not counted when sizing the mapping tables. With the dual device setup
>>>    runt zones might occur at the end of the regular block device, making
>>>    direct translation between zone number and sector/block number complex.
>>>    For metadata version 2 all zones are considered to be of the same size,
>>>    and runt zones are simply marked as 'offline' to have them ignored when
>>>    allocating a new zone.
>>> - The block number in the superblock is now the global number, and refers to
>>>    the location of the superblock relative to the resulting device-mapper
>>>    device. Which means that the tertiary superblock contains absolute block
>>>    addresses, which needs to be translated to the relative device addresses
>>>    to find the referenced block.
>>>
>>> There is an accompanying patchset for dm-zoned-tools for writing and checking
>>> this new metadata.
>>>
>>> As usual, comments and reviews are welcome.
>>
>> I gave this series a good round of testing. See the attached picture for the
>> results. The test is this:
>> 1) Setup dm-zoned
>> 2) Format and mount with mkfs.ext4 -E packed_meta_blocks=1 /dev/mapper/xxx
>> 3) Create file random in size between 1 and 4MB and measure user seen throughput
>> over 100 files.
>> 3) Run that for 2 hours
>>
>> I ran this over a 15TB SMR drive single drive setup, and on the same drive + a
>> 500GB m.2 ssd added.
>>
>> For the single drive case, the usual 3 phases can be seen: start writing at
>> about 110MB/s, everything going to conventional zones (note conv zones are in
>> the middle of the disk, hence the low-ish throughput). Then after about 400s,
>> reclaim kicks in and the throughput drops to 60-70 MB/s. As reclaim cannot keep
>> up under this heavy write workload, performance drops to 20-30MB/s after 800s.
>> All good, without any idle time for reclaim to do its job, this is all expected.
>>
>> For the dual drive case, things are more interesting:
>> 1) The first phase is longer as overall, there is more conventional space (500G
>> ssd + 400G on SMR drive). So we see the SSD speed first (~425MB/s), then the
>> drive speed (100 MB/s), slightly lower than the single drive case toward the end
>> as reclaim triggers.
>> 2) Some recovery back to ssd speed, then a long phase at half the speed of the
>> ssd as writes go to ssd and reclaim is running moving data out of the ssd onto
>> the disk.
>> 3) Then a long phase at 25MB/s due to SMR disk reclaim.
>> 4) back up to half the ssd speed.
>>
>> No crashes, no data corruption, all good. But is does look like we can improve
>> on performance further by preventing using the drive conventional zones as
>> "buffer" zones. If we let those be the final resting place of data, the SMR disk
>> only reclaim would not kick in and hurt performance as seen here. That I think
>> can all be done on top of this series though. Let's get this in first.
>>
> Thanks for the data! That indeed is very interesting; guess I'll do some 
> tests here on my setup, too.
> (And hope it doesn't burn my NVDIMM ...)
> 
> But, guess what, I had the some thoughts; we should be treating the 
> random zones more like sequential zones in a two-disk setup.
> So guess I'll be resurrecting the idea from my very first patch and 
> implement 'cache' zones in addition to the existing 'random' and 
> 'sequential' zones.

Yes, exactly. With that, reclaim can be modified simply to work only on cache
zones and do not touch random zones. That will also work nicely for both single
and dual drive setups, with the difference that single drive will have no random
zones (all conventional zones will be cache zones).

With that, we could also play with intelligent zone allocation on the SMR drive
to try to put data that is most susceptible to change in random zones. Doing so,
we can do update in-place after the first reclaim of a cache zone into a random
zone and reduce overall reclaim overhead.

> But, as you said, that'll be a next series of patches.
> 
> What program did you use as a load generator?
> 
> Cheers,
> 
> Hannes
>

Damien Le Moal May 11, 2020, 10:55 a.m. UTC | #5

On 2020/05/11 11:46, Damien Le Moal wrote:
> Mike,
> 
> I am still seeing the warning:
> 
> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
> alignment_offset=0, start=0
> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
> alignment_offset=0, start=0
> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
> alignment_offset=0, start=0
> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
> alignment_offset=0, start=0
> 
> when mixing 512B sector and 4KB sector devices. Investigating now.

OK. Figured that one out: the 500GB SSD I am using for the regular device is
976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
24624 sectors, which is ignored. But the start sector of the second device in
the dm-table remains 976773168, so not aligned on 4K. This causes
bdev_stack_limits to return an error and the above messages to be printed.

So I think we need to completely ignore the eventual "runt" zone of the regular
device so that everything aligns correctly. This will need changes in both
dmzadm and dm-zoned.

Hannes, I can hack something on top of your series. Or can you resend with that
fixed ?

Hannes Reinecke May 11, 2020, 11:19 a.m. UTC | #6

On 5/11/20 12:55 PM, Damien Le Moal wrote:
> On 2020/05/11 11:46, Damien Le Moal wrote:
>> Mike,
>>
>> I am still seeing the warning:
>>
>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>>
>> when mixing 512B sector and 4KB sector devices. Investigating now.
> 
> 
> OK. Figured that one out: the 500GB SSD I am using for the regular device is
> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
> 24624 sectors, which is ignored. But the start sector of the second device in
> the dm-table remains 976773168, so not aligned on 4K. This causes
> bdev_stack_limits to return an error and the above messages to be printed.
> 
> So I think we need to completely ignore the eventual "runt" zone of the regular
> device so that everything aligns correctly. This will need changes in both
> dmzadm and dm-zoned.
> 
> Hannes, I can hack something on top of your series. Or can you resend with that
> fixed ?
> 
> 
I _thought_ I had this fixed; the idea was to manipulate the 'runt' zone 
such that the zone would always displayed as a zone with same size as 
all the other zones, but marked as offline. IE the (logical) zone layout 
would always be equidistant, with no runt zones in between.
 From that perspective the actual size of the runt zone wouldn't matter 
at all.

Lemme check.

Cheers,

Hannes

Hannes Reinecke May 11, 2020, 11:24 a.m. UTC | #7

On 5/11/20 12:55 PM, Damien Le Moal wrote:
> On 2020/05/11 11:46, Damien Le Moal wrote:
>> Mike,
>>
>> I am still seeing the warning:
>>
>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>> alignment_offset=0, start=0
>>
>> when mixing 512B sector and 4KB sector devices. Investigating now.
> 
> 
> OK. Figured that one out: the 500GB SSD I am using for the regular device is
> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
> 24624 sectors, which is ignored. But the start sector of the second device in
> the dm-table remains 976773168, so not aligned on 4K. This causes
> bdev_stack_limits to return an error and the above messages to be printed.
> 
> So I think we need to completely ignore the eventual "runt" zone of the regular
> device so that everything aligns correctly. This will need changes in both
> dmzadm and dm-zoned.
> 
> Hannes, I can hack something on top of your series. Or can you resend with that
> fixed ?
> 
> 
> 
> 
Does this one help?

diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index ea43f6892ced..5daca82b5ec7 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -1041,13 +1041,17 @@ static int dmz_iterate_devices(struct dm_target *ti,
  {
         struct dmz_target *dmz = ti->private;
         unsigned int zone_nr_sectors = dmz_zone_nr_sectors(dmz->metadata);
+       unsigned int nr_zones;
         sector_t capacity;
         int r;

-       capacity = dmz->dev[0].capacity & ~(zone_nr_sectors - 1);
+       nr_zones = DIV_ROUND_DOWN(dmz->dev[0].capacity, zone_nr_sectors);
+       capacity = nr_zones * zone_nr_sectors;
         r = fn(ti, dmz->ddev[0], 0, capacity, data);
         if (!r && dmz->ddev[1]) {
-               capacity = dmz->dev[1].capacity & ~(zone_nr_sectors - 1);
+               nr_zones = DIV_ROUND_DOWN(dmz->dev[1.capacity,
+                                                  zone_nr_sectors));
+               capacity = nr_zones * zone_nr_sectors;
                 r = fn(ti, dmz->ddev[1], 0, capacity, data);
         }
         return r;

Cheers,

Hannes

Damien Le Moal May 11, 2020, 11:25 a.m. UTC | #8

On 2020/05/11 20:19, Hannes Reinecke wrote:
> On 5/11/20 12:55 PM, Damien Le Moal wrote:
>> On 2020/05/11 11:46, Damien Le Moal wrote:
>>> Mike,
>>>
>>> I am still seeing the warning:
>>>
>>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>>
>>> when mixing 512B sector and 4KB sector devices. Investigating now.
>>
>>
>> OK. Figured that one out: the 500GB SSD I am using for the regular device is
>> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
>> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
>> 24624 sectors, which is ignored. But the start sector of the second device in
>> the dm-table remains 976773168, so not aligned on 4K. This causes
>> bdev_stack_limits to return an error and the above messages to be printed.
>>
>> So I think we need to completely ignore the eventual "runt" zone of the regular
>> device so that everything aligns correctly. This will need changes in both
>> dmzadm and dm-zoned.
>>
>> Hannes, I can hack something on top of your series. Or can you resend with that
>> fixed ?
>>
>>
> I _thought_ I had this fixed; the idea was to manipulate the 'runt' zone 
> such that the zone would always displayed as a zone with same size as 
> all the other zones, but marked as offline. IE the (logical) zone layout 
> would always be equidistant, with no runt zones in between.
>  From that perspective the actual size of the runt zone wouldn't matter 
> at all.
> 
> Lemme check.

Was just playing with dmzadm right now, and I did notice that the second device
start offset is indeed a round number of zones, larger than the actual regular
device capacity in my test case. So indeed, that code is in place there.

So the problem may be on the kernel side, something using the first dev capacity
as is instead of the rounded-up value to the zone size... Digging too.

> 
> Cheers,
> 
> Hannes
>

Damien Le Moal May 11, 2020, 11:46 a.m. UTC | #9

On 2020/05/11 20:25, Hannes Reinecke wrote:
> On 5/11/20 12:55 PM, Damien Le Moal wrote:
>> On 2020/05/11 11:46, Damien Le Moal wrote:
>>> Mike,
>>>
>>> I am still seeing the warning:
>>>
>>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>> alignment_offset=0, start=0
>>>
>>> when mixing 512B sector and 4KB sector devices. Investigating now.
>>
>>
>> OK. Figured that one out: the 500GB SSD I am using for the regular device is
>> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
>> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
>> 24624 sectors, which is ignored. But the start sector of the second device in
>> the dm-table remains 976773168, so not aligned on 4K. This causes
>> bdev_stack_limits to return an error and the above messages to be printed.
>>
>> So I think we need to completely ignore the eventual "runt" zone of the regular
>> device so that everything aligns correctly. This will need changes in both
>> dmzadm and dm-zoned.
>>
>> Hannes, I can hack something on top of your series. Or can you resend with that
>> fixed ?
>>
>>
>>
>>
> Does this one help?

Nope. Same warning.

> 
> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
> index ea43f6892ced..5daca82b5ec7 100644
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -1041,13 +1041,17 @@ static int dmz_iterate_devices(struct dm_target *ti,
>   {
>          struct dmz_target *dmz = ti->private;
>          unsigned int zone_nr_sectors = dmz_zone_nr_sectors(dmz->metadata);
> +       unsigned int nr_zones;
>          sector_t capacity;
>          int r;
> 
> -       capacity = dmz->dev[0].capacity & ~(zone_nr_sectors - 1);
> +       nr_zones = DIV_ROUND_DOWN(dmz->dev[0].capacity, zone_nr_sectors);
> +       capacity = nr_zones * zone_nr_sectors;

	capacity = round_down(dmz->dev[0].capacity, zone_nr_sectors);

is simpler :)

In any case, your change does seem to do anything here. Before and after, the
capacity is rounded down to full zones, excluding the last runt zone. I think it
is to do with the table entry start offset given on DM start by dmzadm...


>          r = fn(ti, dmz->ddev[0], 0, capacity, data);
>          if (!r && dmz->ddev[1]) {
> -               capacity = dmz->dev[1].capacity & ~(zone_nr_sectors - 1);
> +               nr_zones = DIV_ROUND_DOWN(dmz->dev[1.capacity,
> +                                                  zone_nr_sectors));
> +               capacity = nr_zones * zone_nr_sectors;
>                  r = fn(ti, dmz->ddev[1], 0, capacity, data);
>          }
>          return r;
> 
> Cheers,
> 
> Hannes
>

Damien Le Moal May 11, 2020, 1:23 p.m. UTC | #10

On 2020/05/11 20:46, Damien Le Moal wrote:
> On 2020/05/11 20:25, Hannes Reinecke wrote:
>> On 5/11/20 12:55 PM, Damien Le Moal wrote:
>>> On 2020/05/11 11:46, Damien Le Moal wrote:
>>>> Mike,
>>>>
>>>> I am still seeing the warning:
>>>>
>>>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>> alignment_offset=0, start=0
>>>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>> alignment_offset=0, start=0
>>>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>> alignment_offset=0, start=0
>>>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>> alignment_offset=0, start=0
>>>>
>>>> when mixing 512B sector and 4KB sector devices. Investigating now.
>>>
>>>
>>> OK. Figured that one out: the 500GB SSD I am using for the regular device is
>>> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
>>> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
>>> 24624 sectors, which is ignored. But the start sector of the second device in
>>> the dm-table remains 976773168, so not aligned on 4K. This causes
>>> bdev_stack_limits to return an error and the above messages to be printed.
>>>
>>> So I think we need to completely ignore the eventual "runt" zone of the regular
>>> device so that everything aligns correctly. This will need changes in both
>>> dmzadm and dm-zoned.
>>>
>>> Hannes, I can hack something on top of your series. Or can you resend with that
>>> fixed ?
>>>
>>>
>>>
>>>
>> Does this one help?
> 
> Nope. Same warning.
> 
>>
>> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
>> index ea43f6892ced..5daca82b5ec7 100644
>> --- a/drivers/md/dm-zoned-target.c
>> +++ b/drivers/md/dm-zoned-target.c
>> @@ -1041,13 +1041,17 @@ static int dmz_iterate_devices(struct dm_target *ti,
>>   {
>>          struct dmz_target *dmz = ti->private;
>>          unsigned int zone_nr_sectors = dmz_zone_nr_sectors(dmz->metadata);
>> +       unsigned int nr_zones;
>>          sector_t capacity;
>>          int r;
>>
>> -       capacity = dmz->dev[0].capacity & ~(zone_nr_sectors - 1);
>> +       nr_zones = DIV_ROUND_DOWN(dmz->dev[0].capacity, zone_nr_sectors);
>> +       capacity = nr_zones * zone_nr_sectors;
> 
> 	capacity = round_down(dmz->dev[0].capacity, zone_nr_sectors);
> 
> is simpler :)
> 
> In any case, your change does seem to do anything here. Before and after, the
> capacity is rounded down to full zones, excluding the last runt zone. I think it
> is to do with the table entry start offset given on DM start by dmzadm...
> 
> 
>>          r = fn(ti, dmz->ddev[0], 0, capacity, data);
>>          if (!r && dmz->ddev[1]) {
>> -               capacity = dmz->dev[1].capacity & ~(zone_nr_sectors - 1);
>> +               nr_zones = DIV_ROUND_DOWN(dmz->dev[1.capacity,
>> +                                                  zone_nr_sectors));
>> +               capacity = nr_zones * zone_nr_sectors;
>>                  r = fn(ti, dmz->ddev[1], 0, capacity, data);
>>          }
>>          return r;
>>
>> Cheers,

The failure of bdev_stack_limits() generating the warning is due to the io_opt
limit not being compatible with the physical blocks size... Nothing to do with
zone start/runt zones.

The problem is here:

	/* Optimal I/O a multiple of the physical block size? */
        if (t->io_opt & (t->physical_block_size - 1)) {
                t->io_opt = 0;
                t->misaligned = 1;
                ret = -1;
        }

For the ssd (t), I have io_opt at 512 and the physical block size at 4096,
changed due to the satcking from the device real 512 phys block to the smr disk
phys block. The SMR disk io_opt is 0, so the ssd io_opt remains unchaged at 512.
And we end up with the misaligned trigger since 512 & 4095 = 512...

I do not understand clearly yet... I wonder why the io_opt for the SMR drive is
not 4096, same as the physical sector size.

Late today. Will keep digging tomorrow.

Cheers.


>>
>> Hannes
>>
> 
>

Mike Snitzer May 12, 2020, 4:49 p.m. UTC | #11

On Mon, May 11 2020 at  2:31am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 5/11/20 4:46 AM, Damien Le Moal wrote:
> >On 2020/05/08 18:03, Hannes Reinecke wrote:
> >>Hi all,
> >>
> >>this patchset adds a new metadata version 2 for dm-zoned, which brings the
> >>following improvements:
> >>
> >>- UUIDs and labels: Adding three more fields to the metadata containing
> >>   the dm-zoned device UUID and label, and the device UUID. This allows
> >>   for an unique identification of the devices, so that several dm-zoned
> >>   sets can coexist and have a persistent identification.
> >>- Extend random zones by an additional regular disk device: A regular
> >>   block device can be added together with the zoned block device, providing
> >>   additional (emulated) random write zones. With this it's possible to
> >>   handle sequential zones only devices; also there will be a speed-up if
> >>   the regular block device resides on a fast medium. The regular block device
> >>   is placed logically in front of the zoned block device, so that metadata
> >>   and mapping tables reside on the regular block device, not the zoned device.
> >>- Tertiary superblock support: In addition to the two existing sets of metadata
> >>   another, tertiary, superblock is written to the first block of the zoned
> >>   block device. This superblock is for identification only; the generation
> >>   number is set to '0' and the block itself it never updated. The addition
> >>   metadate like bitmap tables etc are not copied.
> >>
> >>To handle this, some changes to the original handling are introduced:
> >>- Zones are now equidistant. Originally, runt zones were ignored, and
> >>   not counted when sizing the mapping tables. With the dual device setup
> >>   runt zones might occur at the end of the regular block device, making
> >>   direct translation between zone number and sector/block number complex.
> >>   For metadata version 2 all zones are considered to be of the same size,
> >>   and runt zones are simply marked as 'offline' to have them ignored when
> >>   allocating a new zone.
> >>- The block number in the superblock is now the global number, and refers to
> >>   the location of the superblock relative to the resulting device-mapper
> >>   device. Which means that the tertiary superblock contains absolute block
> >>   addresses, which needs to be translated to the relative device addresses
> >>   to find the referenced block.
> >>
> >>There is an accompanying patchset for dm-zoned-tools for writing and checking
> >>this new metadata.
> >>
> >>As usual, comments and reviews are welcome.
> >
> >I gave this series a good round of testing. See the attached picture for the
> >results. The test is this:
> >1) Setup dm-zoned
> >2) Format and mount with mkfs.ext4 -E packed_meta_blocks=1 /dev/mapper/xxx
> >3) Create file random in size between 1 and 4MB and measure user seen throughput
> >over 100 files.
> >3) Run that for 2 hours
> >
> >I ran this over a 15TB SMR drive single drive setup, and on the same drive + a
> >500GB m.2 ssd added.
> >
> >For the single drive case, the usual 3 phases can be seen: start writing at
> >about 110MB/s, everything going to conventional zones (note conv zones are in
> >the middle of the disk, hence the low-ish throughput). Then after about 400s,
> >reclaim kicks in and the throughput drops to 60-70 MB/s. As reclaim cannot keep
> >up under this heavy write workload, performance drops to 20-30MB/s after 800s.
> >All good, without any idle time for reclaim to do its job, this is all expected.
> >
> >For the dual drive case, things are more interesting:
> >1) The first phase is longer as overall, there is more conventional space (500G
> >ssd + 400G on SMR drive). So we see the SSD speed first (~425MB/s), then the
> >drive speed (100 MB/s), slightly lower than the single drive case toward the end
> >as reclaim triggers.
> >2) Some recovery back to ssd speed, then a long phase at half the speed of the
> >ssd as writes go to ssd and reclaim is running moving data out of the ssd onto
> >the disk.
> >3) Then a long phase at 25MB/s due to SMR disk reclaim.
> >4) back up to half the ssd speed.
> >
> >No crashes, no data corruption, all good. But is does look like we can improve
> >on performance further by preventing using the drive conventional zones as
> >"buffer" zones. If we let those be the final resting place of data, the SMR disk
> >only reclaim would not kick in and hurt performance as seen here. That I think
> >can all be done on top of this series though. Let's get this in first.
> >
> Thanks for the data! That indeed is very interesting; guess I'll do
> some tests here on my setup, too.
> (And hope it doesn't burn my NVDIMM ...)
> 
> But, guess what, I had the some thoughts; we should be treating the
> random zones more like sequential zones in a two-disk setup.
> So guess I'll be resurrecting the idea from my very first patch and
> implement 'cache' zones in addition to the existing 'random' and
> 'sequential' zones.
> But, as you said, that'll be a next series of patches.

FYI, I staged the series in linux-next (for 5.8) yesterday, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.8

So please base any follow-on fixes or advances on this baseline.

Thanks!
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Damien Le Moal May 13, 2020, 11:56 p.m. UTC | #12

Adding Martin.

On 2020/05/11 22:23, Damien Le Moal wrote:
> On 2020/05/11 20:46, Damien Le Moal wrote:
>> On 2020/05/11 20:25, Hannes Reinecke wrote:
>>> On 5/11/20 12:55 PM, Damien Le Moal wrote:
>>>> On 2020/05/11 11:46, Damien Le Moal wrote:
>>>>> Mike,
>>>>>
>>>>> I am still seeing the warning:
>>>>>
>>>>> [ 1827.839756] device-mapper: table: 253:1: adding target device sdj caused an
>>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>>> alignment_offset=0, start=0
>>>>> [ 1827.856738] device-mapper: table: 253:1: adding target device sdj caused an
>>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>>> alignment_offset=0, start=0
>>>>> [ 1827.874031] device-mapper: table: 253:1: adding target device sdj caused an
>>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>>> alignment_offset=0, start=0
>>>>> [ 1827.891086] device-mapper: table: 253:1: adding target device sdj caused an
>>>>> alignment inconsistency: physical_block_size=4096, logical_block_size=4096,
>>>>> alignment_offset=0, start=0
>>>>>
>>>>> when mixing 512B sector and 4KB sector devices. Investigating now.
>>>>
>>>>
>>>> OK. Figured that one out: the 500GB SSD I am using for the regular device is
>>>> 976773168 512B sectors capacity, that is, not a multiple of the 256MB zone size,
>>>> and not even a multiple of 4K. This causes the creation of a 12MB runt zone of
>>>> 24624 sectors, which is ignored. But the start sector of the second device in
>>>> the dm-table remains 976773168, so not aligned on 4K. This causes
>>>> bdev_stack_limits to return an error and the above messages to be printed.
>>>>
>>>> So I think we need to completely ignore the eventual "runt" zone of the regular
>>>> device so that everything aligns correctly. This will need changes in both
>>>> dmzadm and dm-zoned.
>>>>
>>>> Hannes, I can hack something on top of your series. Or can you resend with that
>>>> fixed ?
>>>>
>>>>
>>>>
>>>>
>>> Does this one help?
>>
>> Nope. Same warning.
>>
>>>
>>> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
>>> index ea43f6892ced..5daca82b5ec7 100644
>>> --- a/drivers/md/dm-zoned-target.c
>>> +++ b/drivers/md/dm-zoned-target.c
>>> @@ -1041,13 +1041,17 @@ static int dmz_iterate_devices(struct dm_target *ti,
>>>   {
>>>          struct dmz_target *dmz = ti->private;
>>>          unsigned int zone_nr_sectors = dmz_zone_nr_sectors(dmz->metadata);
>>> +       unsigned int nr_zones;
>>>          sector_t capacity;
>>>          int r;
>>>
>>> -       capacity = dmz->dev[0].capacity & ~(zone_nr_sectors - 1);
>>> +       nr_zones = DIV_ROUND_DOWN(dmz->dev[0].capacity, zone_nr_sectors);
>>> +       capacity = nr_zones * zone_nr_sectors;
>>
>> 	capacity = round_down(dmz->dev[0].capacity, zone_nr_sectors);
>>
>> is simpler :)
>>
>> In any case, your change does seem to do anything here. Before and after, the
>> capacity is rounded down to full zones, excluding the last runt zone. I think it
>> is to do with the table entry start offset given on DM start by dmzadm...
>>
>>
>>>          r = fn(ti, dmz->ddev[0], 0, capacity, data);
>>>          if (!r && dmz->ddev[1]) {
>>> -               capacity = dmz->dev[1].capacity & ~(zone_nr_sectors - 1);
>>> +               nr_zones = DIV_ROUND_DOWN(dmz->dev[1.capacity,
>>> +                                                  zone_nr_sectors));
>>> +               capacity = nr_zones * zone_nr_sectors;
>>>                  r = fn(ti, dmz->ddev[1], 0, capacity, data);
>>>          }
>>>          return r;
>>>
>>> Cheers,
> 
> The failure of bdev_stack_limits() generating the warning is due to the io_opt
> limit not being compatible with the physical blocks size... Nothing to do with
> zone start/runt zones.
> 
> The problem is here:
> 
> 	/* Optimal I/O a multiple of the physical block size? */
>         if (t->io_opt & (t->physical_block_size - 1)) {
>                 t->io_opt = 0;
>                 t->misaligned = 1;
>                 ret = -1;
>         }
> 
> For the ssd (t), I have io_opt at 512 and the physical block size at 4096,
> changed due to the satcking from the device real 512 phys block to the smr disk
> phys block. The SMR disk io_opt is 0, so the ssd io_opt remains unchaged at 512.
> And we end up with the misaligned trigger since 512 & 4095 = 512...
> 
> I do not understand clearly yet... I wonder why the io_opt for the SMR drive is
> not 4096, same as the physical sector size.

I investigated this and here is what I found out:

When the dual drive setup is started, dm_calculate_queue_limits() is called and
ti->type->iterate_devices(ti, dm_set_device_limits, &ti_limits) executed.

In dm-zoned, the iterate device method executes dm_set_device_limits() twice,
once for each drive of the setup.

The drives I am using are an M.2 SSD with a phys sector size of 512B and an
optimal IO size set to 512. The SMR drive has a phys sector size of 4K and the
optimal IO size set to 0, the drive does not report any value, not uncommon for
HDDs.

The result of bdev_stack_limits() execution from dm_set_device_limits() gives a
DM device phys sector of 4K, no surprise. The io_opt limit though end up being
512 = lcm(0, 512). That results in this condition to trigger:

	/* Optimal I/O a multiple of the physical block size? */
        if (t->io_opt & (t->physical_block_size - 1)) {
                t->io_opt = 0;
                t->misaligned = 1;
                ret = -1;
        }

since 512 & (4096 - 1) is not 0...

It looks to me like we should either have io_opt always be at least the phys
sector size, or change the limit stacking io_opt handling. I am not sure which
approach is best... Thoughts ?

Martin,

Any idea why the io_opt limit is not set to the physical block size when the
drive does not report an optimal transfer length ? Would it be bad to set that
value instead of leaving it to 0 ?

Martin K. Petersen May 14, 2020, 12:20 a.m. UTC | #13

Damien,

> Any idea why the io_opt limit is not set to the physical block size
> when the drive does not report an optimal transfer length ? Would it
> be bad to set that value instead of leaving it to 0 ?

The original intent was that io_opt was a weak heuristic for something
being a RAID device. Regular disk drives didn't report it. These days
that distinction probably isn't relevant.

However, before we entertain departing from the historic io_opt
behavior, I am a bit puzzled by the fact that you have a device that
reports io_opt as 512 bytes. What kind of device performs best when each
I/O is limited to a single logical block?

Damien Le Moal May 14, 2020, 12:55 a.m. UTC | #14

On 2020/05/14 9:22, Martin K. Petersen wrote:
> 
> Damien,
> 
>> Any idea why the io_opt limit is not set to the physical block size
>> when the drive does not report an optimal transfer length ? Would it
>> be bad to set that value instead of leaving it to 0 ?
> 
> The original intent was that io_opt was a weak heuristic for something
> being a RAID device. Regular disk drives didn't report it. These days
> that distinction probably isn't relevant.
> 
> However, before we entertain departing from the historic io_opt
> behavior, I am a bit puzzled by the fact that you have a device that
> reports io_opt as 512 bytes. What kind of device performs best when each
> I/O is limited to a single logical block?
> 

Indeed. It is an NVMe M.2 consumer grade SSD. Nothing fancy. If you look at
nvme/host/core.c nvme_update_disk_info(), you will see that io_opt is set to the
block size... This is probably abusing this limit. So I guess the most elegant
fix may be to have nvme stop doing that ?

Martin K. Petersen May 14, 2020, 2:19 a.m. UTC | #15

Damien,

> Indeed. It is an NVMe M.2 consumer grade SSD. Nothing fancy. If you
> look at nvme/host/core.c nvme_update_disk_info(), you will see that
> io_opt is set to the block size... This is probably abusing this
> limit. So I guess the most elegant fix may be to have nvme stop doing
> that ?

Yeah, I'd prefer for io_opt to only be set if the device actually
reports NOWS.

The purpose of io_min is to be the preferred lower I/O size
boundary. One should not submit I/Os smaller than this.

And io_opt is the preferred upper boundary for I/Os. One should not
issue I/Os larger than this value. Setting io_opt to the logical block
size kind of defeats that intent.

That said, we should probably handle the case where the pbs gets scaled
up but io_opt doesn't more gracefully.

Damien Le Moal May 14, 2020, 2:22 a.m. UTC | #16

On 2020/05/14 11:20, Martin K. Petersen wrote:
> 
> Damien,
> 
>> Indeed. It is an NVMe M.2 consumer grade SSD. Nothing fancy. If you
>> look at nvme/host/core.c nvme_update_disk_info(), you will see that
>> io_opt is set to the block size... This is probably abusing this
>> limit. So I guess the most elegant fix may be to have nvme stop doing
>> that ?
> 
> Yeah, I'd prefer for io_opt to only be set if the device actually
> reports NOWS.

Sent a patch :)

> 
> The purpose of io_min is to be the preferred lower I/O size
> boundary. One should not submit I/Os smaller than this.
> 
> And io_opt is the preferred upper boundary for I/Os. One should not
> issue I/Os larger than this value. Setting io_opt to the logical block
> size kind of defeats that intent.
> 
> That said, we should probably handle the case where the pbs gets scaled
> up but io_opt doesn't more gracefully.

Yes. Will look at that too.

Thanks !

>

Hannes Reinecke May 14, 2020, 5:56 a.m. UTC | #17

On 5/14/20 2:55 AM, Damien Le Moal wrote:
> On 2020/05/14 9:22, Martin K. Petersen wrote:
>>
>> Damien,
>>
>>> Any idea why the io_opt limit is not set to the physical block size
>>> when the drive does not report an optimal transfer length ? Would it
>>> be bad to set that value instead of leaving it to 0 ?
>>
>> The original intent was that io_opt was a weak heuristic for something
>> being a RAID device. Regular disk drives didn't report it. These days
>> that distinction probably isn't relevant.
>>
>> However, before we entertain departing from the historic io_opt
>> behavior, I am a bit puzzled by the fact that you have a device that
>> reports io_opt as 512 bytes. What kind of device performs best when each
>> I/O is limited to a single logical block?
>>
> 
> Indeed. It is an NVMe M.2 consumer grade SSD. Nothing fancy. If you look at
> nvme/host/core.c nvme_update_disk_info(), you will see that io_opt is set to the
> block size... This is probably abusing this limit. So I guess the most elegant
> fix may be to have nvme stop doing that ?
> 
> 
Yes, I guess that would be the best approach. If the driver doesn't 
report it we shouldn't make up any values but rather leave it at '0'.

Cheers,

Hannes

[PATCHv5,00/14] dm-zoned: metadata version 2

Message

Comments