[0/2] New zonefs file system

Message ID	20191212183816.102402-1-damien.lemoal@wdc.com (mailing list archive)
Headers	show Return-Path: <SRS0=kM2h=2C=vger.kernel.org=linux-fsdevel-owner@kernel.org> IronPort-SDR: b1+aXZ0u+ZtqMHzrn0UtVKKE4WBEFJIXMOpotz+lXEQ7bCS7aiUDbf+4AUTGd0lk6fouvBgVbS XR7rrxL4yNF9ORM5H6hFSgywrfJ1zQnhVOIuj70h90ahXa4ZWWNdxlfilSW/MZv7bhgfesA7Ls pWi1Pf6Nk700v3vcP1eY5ogedFtyhhCWIVsep9+sSJy+6qw/xcnzEE+QI8tNSbpya5U4hceeeZ RqnL0WxQOr+/KxNXJs6HUkSmxSlPAdh8sYmaDPEzXmF/760+XWkEZFARbqiHquJ/6BsJBMpjUx sGw= IronPort-SDR: h7N+v+recdJeZQ+8I9uh2pthv01QyUGCqOzB+zAKHySt+TYWGYTCZjttLbnxnoNF5Y+TRBUAvS NIplnhOhg/77XBSKxkLZcmPkG39x8uUCSEYs1Z5d9DcVvuvoCrtTgntEdyWVZ3qZvTMvPehNEr Qdydgebxd2jt1VCmudxkulkKK0up/evQt8QAZSABo8K153IfqgSh1+/7qXr7/EYS2sCHR0+3s7 h9YwuiiQFxZH8i6h8BN56U8M25Sco4oUOoKutKHqcUSwQd4o9wCmlbLYMBj9/z/hSRaZfvFJf6 Z+nnqP4Gfus4rQdDptKnYKjx IronPort-SDR: TPgPTPQlHtvgIqj7eK2+O+Gg7txK7Lf/sg7RxKrMHrZnCyjUL7PaTX1kLoGO2r1Jv9yXoOoYIw 53KvuTynalDHC4MQPD5z75DqS42pwnELCFtSTSVdf/P1awUo+DD9ZJu+dRYckR2oW0dAYWWt3n 0inWPL5vYCtjpSsTl1cuNlY0UEGzL5L2MiUmhH8CJg7z+SXOB1cYcHReT3Ew0utcM8wsMKdTzl UOnCm0V712HkojiW0BUmeEvqIMHn+NGB2lNDfRFrePjpFJFNbrARDxwiTcI3jJwueYYQIO+vVR ef8= WDCIronportException: Internal From: Damien Le Moal <damien.lemoal@wdc.com> To: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org> Cc: Johannes Thumshirn <jth@kernel.org>, Naohiro Aota <naohiro.aota@wdc.com>, "Darrick J . Wong" <darrick.wong@oracle.com>, Hannes Reinecke <hare@suse.de> Subject: [PATCH 0/2] New zonefs file system Date: Fri, 13 Dec 2019 03:38:14 +0900 Message-Id: <20191212183816.102402-1-damien.lemoal@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	New zonefs file system \| expand [0/2] New zonefs file system [1/2] fs: New zonefs file system [2/2] zonefs: Add documentation

Damien Le Moal Dec. 12, 2019, 6:38 p.m. UTC

zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
the sequential write constraint of zoned block devices to the user.
Files representing sequential write zones of the device must be written
sequentially starting from the end of the file (append only writes).

zonefs is not a POSIX compliant file system. It's goal is to simplify
the implementation of zoned block devices support in applications by
replacing raw block device file accesses with a richer file based API,
avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application while at the same time
allowing the use of zoned block devices with various programming
languages other than C.

zonefs IO management implementation uses the new iomap generic code.

Damien Le Moal (2):
  fs: New zonefs file system
  zonefs: Add documentation

 Documentation/filesystems/zonefs.txt |  150 ++++
 MAINTAINERS                          |   10 +
 fs/Kconfig                           |    1 +
 fs/Makefile                          |    1 +
 fs/zonefs/Kconfig                    |    9 +
 fs/zonefs/Makefile                   |    4 +
 fs/zonefs/super.c                    | 1158 ++++++++++++++++++++++++++
 fs/zonefs/zonefs.h                   |  169 ++++
 include/uapi/linux/magic.h           |    1 +
 9 files changed, 1503 insertions(+)
 create mode 100644 Documentation/filesystems/zonefs.txt
 create mode 100644 fs/zonefs/Kconfig
 create mode 100644 fs/zonefs/Makefile
 create mode 100644 fs/zonefs/super.c
 create mode 100644 fs/zonefs/zonefs.h

Enrico Weigelt, metux IT consult Dec. 16, 2019, 8:18 a.m. UTC | #1

On 12.12.19 19:38, Damien Le Moal wrote:

Hi,

> zonefs is a very simple file system exposing each zone of a zoned block
> device as a file. Unlike a regular file system with zoned block device
> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
> the sequential write constraint of zoned block devices to the user.

Just curious: what's the exact definition of "zoned" here ?
Something like partitions ?

Can these files then also serve as block devices for other filesystems ?
Just a funny idea: could we handle partitions by a file system ?

Even more funny idea: give file systems block device ops, so they can
be directly used as such (w/o explicitly using loopdev) ;-)

> Files representing sequential write zones of the device must be written
> sequentially starting from the end of the file (append only writes).

So, these files can only be accessed like a tape ?

Assuming you're working ontop of standard block devices anyways (instead
of tape-like media ;-)) - why introducing such a limitation ?

> zonefs is not a POSIX compliant file system. It's goal is to simplify
> the implementation of zoned block devices support in applications by
> replacing raw block device file accesses with a richer file based API,
> avoiding relying on direct block device file ioctls which may
> be more obscure to developers. 

ioctls ?

Last time I checked, block devices could be easily accessed via plain
file ops (read, write, seek, ...). You can basically treat them just
like big files of fixed size.

> One example of this approach is the
> implementation of LSM (log-structured merge) tree structures (such as
> used in RocksDB and LevelDB)

The same LevelDB as used eg. in Chrome browser, which destroys itself
every time a little temporary problem (eg. disk full) occours ?
If that's the usecase I'd rather use an simple in-memory table instead
and and enough swap, as leveldb isn't reliable enough for persistent
data anyways :p

> on zoned block devices by allowing SSTables
> to be stored in a zone file similarly to a regular file system rather
> than as a range of sectors of a zoned device. The introduction of the
> higher level construct "one file is one zone" can help reducing the
> amount of changes needed in the application while at the same time
> allowing the use of zoned block devices with various programming
> languages other than C.

Why not just simply use files on a suited filesystem (w/ low block io
overhead) or LVM volumes ?

--mtx

Carlos Maiolino Dec. 16, 2019, 9:35 a.m. UTC | #2

On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote:
> On 12.12.19 19:38, Damien Le Moal wrote:
> 
> Hi,
> 
> > zonefs is a very simple file system exposing each zone of a zoned block
> > device as a file. Unlike a regular file system with zoned block device
> > support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
> > the sequential write constraint of zoned block devices to the user.
> 
> Just curious: what's the exact definition of "zoned" here ?
> Something like partitions ?

Zones inside a SMR HDD.

> 
> Can these files then also serve as block devices for other filesystems ?
> Just a funny idea: could we handle partitions by a file system ?
> 
> Even more funny idea: give file systems block device ops, so they can
> be directly used as such (w/o explicitly using loopdev) ;-)
> 
> > Files representing sequential write zones of the device must be written
> > sequentially starting from the end of the file (append only writes).
> 
> So, these files can only be accessed like a tape ?

On a SMR HDD, each zone can only be written sequentially, due to physics
constraints. I won't post any link with references because I think majordomo
will spam my email if I do, but do a google search of something like 'SMR HDD
zones' and you'll get a better idea


> 
> Assuming you're working ontop of standard block devices anyways (instead
> of tape-like media ;-)) - why introducing such a limitation ?

The limitation is already there on SMR drives, some of them (Device Managed
models), just hide it from the system.

> 
> > zonefs is not a POSIX compliant file system. It's goal is to simplify
> > the implementation of zoned block devices support in applications by
> > replacing raw block device file accesses with a richer file based API,
> > avoiding relying on direct block device file ioctls which may
> > be more obscure to developers. 
> 
> ioctls ?
> 
> Last time I checked, block devices could be easily accessed via plain
> file ops (read, write, seek, ...). You can basically treat them just
> like big files of fixed size.
> 
> > One example of this approach is the
> > implementation of LSM (log-structured merge) tree structures (such as
> > used in RocksDB and LevelDB)
> 
> The same LevelDB as used eg. in Chrome browser, which destroys itself
> every time a little temporary problem (eg. disk full) occours ?
> If that's the usecase I'd rather use an simple in-memory table instead
> and and enough swap, as leveldb isn't reliable enough for persistent
> data anyways :p
> 
> > on zoned block devices by allowing SSTables
> > to be stored in a zone file similarly to a regular file system rather
> > than as a range of sectors of a zoned device. The introduction of the
> > higher level construct "one file is one zone" can help reducing the
> > amount of changes needed in the application while at the same time
> > allowing the use of zoned block devices with various programming
> > languages other than C.
> 
> Why not just simply use files on a suited filesystem (w/ low block io
> overhead) or LVM volumes ?
> 
> 
> --mtx
> 
> -- 
> Dringender Hinweis: aufgrund existenzieller Bedrohung durch "Emotet"
> sollten Sie *niemals* MS-Office-Dokumente via E-Mail annehmen/öffenen,
> selbst wenn diese von vermeintlich vertrauenswürdigen Absendern zu
> stammen scheinen. Andernfalls droht Totalschaden.
> ---
> Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
> werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
> GPG/PGP-Schlüssel zu.
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> info@metux.net -- +49-151-27565287
>

Carlos Maiolino Dec. 16, 2019, 9:42 a.m. UTC | #3

On Mon, Dec 16, 2019 at 10:36:00AM +0100, Carlos Maiolino wrote:
> On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote:
> > On 12.12.19 19:38, Damien Le Moal wrote:
> > 
> > Hi,
> > 
> > > zonefs is a very simple file system exposing each zone of a zoned block
> > > device as a file. Unlike a regular file system with zoned block device
> > > support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
> > > the sequential write constraint of zoned block devices to the user.
> > 
> > Just curious: what's the exact definition of "zoned" here ?
> > Something like partitions ?
> 
> Zones inside a SMR HDD.
> 

Btw, Zoned devices concept are not limited on HDDs only. I'm not sure now if the
patchset itself also targets SMR devices or is more focused on Zoned SDDs, but
well, the limitation where each zone can only be written sequentially still
applies.

Damien Le Moal Dec. 17, 2019, 12:05 a.m. UTC | #4

On 2019/12/16 17:19, Enrico Weigelt, metux IT consult wrote:
> On 12.12.19 19:38, Damien Le Moal wrote:
> 
> Hi,
> 
>> zonefs is a very simple file system exposing each zone of a zoned block
>> device as a file. Unlike a regular file system with zoned block device
>> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
>> the sequential write constraint of zoned block devices to the user.
> 
> Just curious: what's the exact definition of "zoned" here ?
> Something like partitions ?

As Carlos commented already, a zoned block device is Linux abstraction
used to handle SMR HDDs (Shingled Magnetic Recording). These disks
expose an LBA range that is divided into zones that can only be written
sequentially for host-managed models. Other models such as host-aware or
drive-managed allow random writes to all zones at the cost of potential
serious performance degradation due to disk internal garbage collection
of zones (similarly to an SSD handling of erase blocks).

While today zoned block devices exist on the market only in the form of
SMR disks, NVMe SSDs will also soon be available with the completion of
the Zoned Namespace specifications.

Zoning of block devices has several advantages: higher capacities for
HDDs and more predictable and lower IO latencies for SSDs (almost no
internal GC/weir leveling needed). But taking full advantage of these
devices require software changes on the host due to the sequential write
constraint imposed by the devices interface.

> Can these files then also serve as block devices for other filesystems ?
> Just a funny idea: could we handle partitions by a file system ?
> 
> Even more funny idea: give file systems block device ops, so they can
> be directly used as such (w/o explicitly using loopdev) ;-)

This is outside the scope of this thread, so let's not start a
discussion about this here. Start a new thread !

>> Files representing sequential write zones of the device must be written
>> sequentially starting from the end of the file (append only writes).
> 
> So, these files can only be accessed like a tape ?

Writes must be sequential within a zone but reads can be random to any
writen LBA.

> Assuming you're working ontop of standard block devices anyways (instead
> of tape-like media ;-)) - why introducing such a limitation ?

See above: the limitation is physical, by the device, so that different
improvements can be achieved depending on the storage medium being used
(increased capacity, lower latencies, lower over provisioning, etc)

> 
>> zonefs is not a POSIX compliant file system. It's goal is to simplify
>> the implementation of zoned block devices support in applications by
>> replacing raw block device file accesses with a richer file based API,
>> avoiding relying on direct block device file ioctls which may
>> be more obscure to developers. 
> 
> ioctls ?
> 
> Last time I checked, block devices could be easily accessed via plain
> file ops (read, write, seek, ...). You can basically treat them just
> like big files of fixed size.

I was not clear, my apologies. I am refering here to the zoned block
device related ioctls defined in include/uapi/linux/blkzoned.h. These
ioctls allow an application to manage the device zones (obtain zone
information, erase zones, etc). These ioctls trigger issuing zone
related commands to the device. These commands are defined by the ZBC
and ZAC standards for SCSI and ATA, and NVMe Zoned Namespace in the very
near future.

>> One example of this approach is the
>> implementation of LSM (log-structured merge) tree structures (such as
>> used in RocksDB and LevelDB)
> 
> The same LevelDB as used eg. in Chrome browser, which destroys itself
> every time a little temporary problem (eg. disk full) occours ?
> If that's the usecase I'd rather use an simple in-memory table instead
> and and enough swap, as leveldb isn't reliable enough for persistent
> data anyways :p

The intent of my comment was not to advocate for or discuss the merits
of any particular KV implementation. I was only pointing out that zonefs
does not come in a void and that we do have use cases for it and did the
work on some user space software to validate it. Leveldb and RocksDB are
the 2 LSM-tree based KV stores we worked on as they are very popular and
widely used.

>> on zoned block devices by allowing SSTables
>> to be stored in a zone file similarly to a regular file system rather
>> than as a range of sectors of a zoned device. The introduction of the
>> higher level construct "one file is one zone" can help reducing the
>> amount of changes needed in the application while at the same time
>> allowing the use of zoned block devices with various programming
>> languages other than C.
> 
> Why not just simply use files on a suited filesystem (w/ low block io
> overhead) or LVM volumes ?

Using a file system compliant with zoned block device constraint such as
f2fs or btrfs (on-going work) is certainly a valid approach. However,
this may not be the most optimal one if the application being used as a
mostly sequential write behavior. LSM-tree based KV stores fall into
this category: SSTables are large (several MB) and always written
sequentially. There are not random writes, which facilitates supporting
directly zoned block devices without the need for a file system which
would add a GC background process and degrade performance. As mentioned
in the cover letter, zonefs goal is to facilitate the implementation of
this support compared toa pure raw block device use.

> 
> 
> --mtx
>

Damien Le Moal Dec. 17, 2019, 12:26 a.m. UTC | #5

On 2019/12/16 18:42, Carlos Maiolino wrote:
> On Mon, Dec 16, 2019 at 10:36:00AM +0100, Carlos Maiolino wrote:
>> On Mon, Dec 16, 2019 at 09:18:23AM +0100, Enrico Weigelt, metux IT consult wrote:
>>> On 12.12.19 19:38, Damien Le Moal wrote:
>>>
>>> Hi,
>>>
>>>> zonefs is a very simple file system exposing each zone of a zoned block
>>>> device as a file. Unlike a regular file system with zoned block device
>>>> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
>>>> the sequential write constraint of zoned block devices to the user.
>>>
>>> Just curious: what's the exact definition of "zoned" here ?
>>> Something like partitions ?
>>
>> Zones inside a SMR HDD.
>>
> 
> Btw, Zoned devices concept are not limited on HDDs only. I'm not sure now if the
> patchset itself also targets SMR devices or is more focused on Zoned SDDs, but
> well, the limitation where each zone can only be written sequentially still
> applies.

zonefs supports any block device that advertised itself as "zoned"
(blk_queue_is_zoned(q) is true) through the zoned block device
abstraction (block/blk-zoned.c). This includes all SMR HDDs (both SCSI
and ATA), null_blk devices with zoned mode enabled and DM-linear drives
built on top of zoned devices.

On the SSD front, NVMe Zoned Namespace standard is still a draft and
being worked on be the NVMe committee and no devices are available on
the market yet.

Enrico Weigelt, metux IT consult Dec. 17, 2019, 12:33 p.m. UTC | #6

On 16.12.19 10:35, Carlos Maiolino wrote:

Hi,

>> Just curious: what's the exact definition of "zoned" here ?
>> Something like partitions ?
> 
> Zones inside a SMR HDD.

Oh, I wasn't aware that those things are exposed to the host at all.
Are you dealing with host-managed SMR-HDDs ?

> On a SMR HDD, each zone can only be written sequentially, due to physics
> constraints. I won't post any link with references because I think majordomo
> will spam my email if I do, but do a google search of something like 'SMR HDD
> zones' and you'll get a better idea

Reminds me on classic CDRs or tapes. Why not dealing them similarily ?


--mtx

---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

Enrico Weigelt, metux IT consult Dec. 17, 2019, 1:05 p.m. UTC | #7

On 17.12.19 01:26, Damien Le Moal wrote:

Hi,

> On the SSD front, NVMe Zoned Namespace standard is still a draft and
> being worked on be the NVMe committee and no devices are available on
> the market yet.

anybody here who can tell why this could be useful ?

Can erase blocks made be so enourmously huge and is there really a huge
gain in doing so, which makes any practical difference ?

Oh, BTW, since the write semantics seem so similar, why not treating
them similar to raw flashes ?

--mtx

Damien Le Moal Dec. 18, 2019, 12:57 a.m. UTC | #8

On 2019/12/17 21:34, Enrico Weigelt, metux IT consult wrote:
> On 16.12.19 10:35, Carlos Maiolino wrote:
> 
> Hi,
> 
>>> Just curious: what's the exact definition of "zoned" here ?
>>> Something like partitions ?
>>
>> Zones inside a SMR HDD.
> 
> Oh, I wasn't aware that those things are exposed to the host at all.
> Are you dealing with host-managed SMR-HDDs ?

Yes. The host-managed models of SMR drives have become the de-facto
standard for enterprise applications because of their more predictable
performance compared to host-aware models.

Many USB external disks these days also use SMR, but drive-managed
models. These are regular block devices from the interface point of
view: the host does not and cannot see the "zones" of the disk. SMR
constraints are hidden by the device firmware.

> 
>> On a SMR HDD, each zone can only be written sequentially, due to physics
>> constraints. I won't post any link with references because I think majordomo
>> will spam my email if I do, but do a google search of something like 'SMR HDD
>> zones' and you'll get a better idea
> 
> Reminds me on classic CDRs or tapes. Why not dealing them similarily ?

Because of the performance difference. Excluding any software/use
difference (i.e. GC overhead if needed), from a purely IO perspective,
SMR host-managed disks are as fast as regular disks and can handle
multiple streams simultaneously at high queue depth for better
throughput (think video surveillance applications or video streaming).
That is not the case for CDs or tapes.

The performance difference with CDs and tapes, leading to different
possible workloads and usage patterns, is even more pronounced with
SSDs. In the end, only the write pattern looks similar with CDs and
Tapes. Everything else is the same as a regular block device.

> 
> 
> --mtx
> 
> ---
> Enrico Weigelt, metux IT consult
> Free software and Linux embedded engineering
> info@metux.net -- +49-151-27565287
>

Damien Le Moal Dec. 18, 2019, 1:09 a.m. UTC | #9

On 2019/12/17 22:05, Enrico Weigelt, metux IT consult wrote:
> On 17.12.19 01:26, Damien Le Moal wrote:
> 
> Hi,
> 
>> On the SSD front, NVMe Zoned Namespace standard is still a draft and
>> being worked on be the NVMe committee and no devices are available on
>> the market yet.
> 
> anybody here who can tell why this could be useful ?

To reduce device costs thanks to less flash over provisioning needed
(leading to higher usable capacities), simpler device firmware FTL
(leading to lower DRAM needs, so lower power and less heat) and higher
predictability of IO latencies.

Yes, there is the sequential write constraint (that's the "no free
lunch" part of the picture), but many workloads can accommodate this
constraint (any video streaming application, sensor logging, etc...)

> Can erase blocks made be so enourmously huge and is there really a huge
> gain in doing so, which makes any practical difference ?

Making the erase block enormous would likely lead to enormous zone
sizes, which is generally not desired as that becomes very costly if the
application/user needs to do GC on the zones. A balance is generally
reached here between HW media needs and usability.

> Oh, BTW, since the write semantics seem so similar, why not treating
> them similar to raw flashes ?

This is the OpenChannel SSD model. This exists and is supported by Linux
(lightnvm). This model is however more complex due to the plethora of
parameters that the host can/needs to control. The zone model is much
simpler and its application to NVMe with Zoned Namespace fits very well
into the block IO stack work that was done for SMR since kernel 4.10.

Another reason for choosing ZNS over OCSSD is that device vendors can
actually give guarantees for devices sold as the device firmware retains
control over the the flash cells health management, which is much less
the case for OCSSD (the device health depends much more on what the user
is doing).

Best regards.

[0/2] New zonefs file system

Message

Comments