mbox series

[RFC,00/11] pcache: Persistent Memory Cache for Block Devices

Message ID 20250414014505.20477-1-dongsheng.yang@linux.dev (mailing list archive)
Headers show
Series pcache: Persistent Memory Cache for Block Devices | expand

Message

Dongsheng Yang April 14, 2025, 1:44 a.m. UTC
Hi All,

    This patchset introduces a new Linux block layer module called
**pcache**, which uses persistent memory (pmem) as a cache for block
devices.

Originally, this functionality was implemented as `cbd_cache` within the
CBD (CXL Block Device). However, after thorough consideration,
it became clear that the cache design was not limited to CBD's pmem
device or infrastructure. Instead, it is broadly applicable to **any**
persistent memory device that supports DAX. Therefore, I have split
pcache out of cbd and refactored it into a standalone module.

Although Intel's Optane product line has been discontinued, the Storage
Class Memory (SCM) field continues to evolve. For instance, Numemory
recently launched their Optane successor product, the NM101 SCM:
https://www.techpowerup.com/332914/numemory-releases-optane-successor-nm101-storage-class-memory

### About pcache

+-------------------------------+------------------------------+------------------------------+------------------------------+
| Feature                       | pcache                       | bcache                       | dm-writecache                |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| pmem access method            | DAX                          | bio                          | DAX                          |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Write Latency (4K randwrite)  | ~7us                         | ~20us                        | ~7us                         |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Concurrency                   | Multi-tree per backend,      | Shared global index tree,    | single indexing tree and     |
|                               | fully utilizing pmem         |                              | global wc_lock               |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| IOPS (4K randwrite 32 numjobs)| 2107K                        | 352K                         | 283K                         |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Read Cache Support            | YES                          | YES                          | NO                           |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Deployment Flexibility        | No reformat needed           | Requires formatting backend  | Depends on dm framework,     |
|                               |                              | devices                      | less intuitive to deploy     |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Writeback Model               | log-structure; preserves     | no guarantee between         | no guarantee writeback       |
|                               | backing crash-consistency;   | flush order and app IO order;| ordering                     |
|                               | important for checkpoint     | may lose ordering in backing |                              |
+-------------------------------+------------------------------+------------------------------+------------------------------+
| Data Integrity                | CRC on both metadata and     | CRC on metadata only         | No CRC                       |
|                               | data (data crc is optional)  |                              |                              |
+-------------------------------+------------------------------+------------------------------+------------------------------+

### Repository

- Kernel code: https://github.com/DataTravelGuide/linux/tree/pcache
- Userspace tool: https://github.com/DataTravelGuide/pcache-utils

### Example Usage

```bash
$ insmod /workspace/linux_compile/drivers/block/pcache/pcache.ko
$ pcache cache-start --path /dev/pmem1 --format --force
$ pcache backing-start --path /dev/vdh --cache-size 50G --queues 16
/dev/pcache0
$ pcache backing-list
[
    {
        "backing_id": 0,
        "backing_path": "/dev/vdh",
        "cache_segs": 3200,
        "cache_gc_percent": 70,
        "cache_used_segs": 2238,
        "logic_dev": "/dev/pcache0"
    }
]
```

Thanks for reviewing!

Dongsheng Yang (11):
  pcache: introduce cache_dev for managing persistent memory-based cache
    devices
  pcache: introduce segment abstraction
  pcache: introduce meta_segment abstraction
  pcache: introduce cache_segment abstraction
  pcache: introduce lifecycle management of pcache_cache
  pcache: gc and writeback
  pcache: introduce cache_key infrastructure for persistent metadata
    management
  pcache: implement request processing and cache I/O path in cache_req
  pcache: introduce logic block device and request handling
  pcache: add backing device management
  block: introduce pcache (persistent memory to be cache for block
    device)

 MAINTAINERS                            |   8 +
 drivers/block/Kconfig                  |   2 +
 drivers/block/Makefile                 |   2 +
 drivers/block/pcache/Kconfig           |  16 +
 drivers/block/pcache/Makefile          |   4 +
 drivers/block/pcache/backing_dev.c     | 593 +++++++++++++++++
 drivers/block/pcache/backing_dev.h     | 105 +++
 drivers/block/pcache/cache.c           | 394 +++++++++++
 drivers/block/pcache/cache.h           | 612 +++++++++++++++++
 drivers/block/pcache/cache_dev.c       | 808 ++++++++++++++++++++++
 drivers/block/pcache/cache_dev.h       |  81 +++
 drivers/block/pcache/cache_gc.c        | 150 +++++
 drivers/block/pcache/cache_key.c       | 885 +++++++++++++++++++++++++
 drivers/block/pcache/cache_req.c       | 812 +++++++++++++++++++++++
 drivers/block/pcache/cache_segment.c   | 247 +++++++
 drivers/block/pcache/cache_writeback.c | 183 +++++
 drivers/block/pcache/logic_dev.c       | 348 ++++++++++
 drivers/block/pcache/logic_dev.h       |  73 ++
 drivers/block/pcache/main.c            | 194 ++++++
 drivers/block/pcache/meta_segment.c    |  61 ++
 drivers/block/pcache/meta_segment.h    |  46 ++
 drivers/block/pcache/pcache_internal.h | 185 ++++++
 drivers/block/pcache/segment.c         | 175 +++++
 drivers/block/pcache/segment.h         |  78 +++
 24 files changed, 6062 insertions(+)
 create mode 100644 drivers/block/pcache/Kconfig
 create mode 100644 drivers/block/pcache/Makefile
 create mode 100644 drivers/block/pcache/backing_dev.c
 create mode 100644 drivers/block/pcache/backing_dev.h
 create mode 100644 drivers/block/pcache/cache.c
 create mode 100644 drivers/block/pcache/cache.h
 create mode 100644 drivers/block/pcache/cache_dev.c
 create mode 100644 drivers/block/pcache/cache_dev.h
 create mode 100644 drivers/block/pcache/cache_gc.c
 create mode 100644 drivers/block/pcache/cache_key.c
 create mode 100644 drivers/block/pcache/cache_req.c
 create mode 100644 drivers/block/pcache/cache_segment.c
 create mode 100644 drivers/block/pcache/cache_writeback.c
 create mode 100644 drivers/block/pcache/logic_dev.c
 create mode 100644 drivers/block/pcache/logic_dev.h
 create mode 100644 drivers/block/pcache/main.c
 create mode 100644 drivers/block/pcache/meta_segment.c
 create mode 100644 drivers/block/pcache/meta_segment.h
 create mode 100644 drivers/block/pcache/pcache_internal.h
 create mode 100644 drivers/block/pcache/segment.c
 create mode 100644 drivers/block/pcache/segment.h

Comments

Dan Williams April 15, 2025, 6 p.m. UTC | #1
Dongsheng Yang wrote:
> Hi All,
> 
>     This patchset introduces a new Linux block layer module called
> **pcache**, which uses persistent memory (pmem) as a cache for block
> devices.
> 
> Originally, this functionality was implemented as `cbd_cache` within the
> CBD (CXL Block Device). However, after thorough consideration,
> it became clear that the cache design was not limited to CBD's pmem
> device or infrastructure. Instead, it is broadly applicable to **any**
> persistent memory device that supports DAX. Therefore, I have split
> pcache out of cbd and refactored it into a standalone module.
> 
> Although Intel's Optane product line has been discontinued, the Storage
> Class Memory (SCM) field continues to evolve. For instance, Numemory
> recently launched their Optane successor product, the NM101 SCM:
> https://www.techpowerup.com/332914/numemory-releases-optane-successor-nm101-storage-class-memory
> 
> ### About pcache
> 
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Feature                       | pcache                       | bcache                       | dm-writecache                |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | pmem access method            | DAX                          | bio                          | DAX                          |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Write Latency (4K randwrite)  | ~7us                         | ~20us                        | ~7us                         |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Concurrency                   | Multi-tree per backend,      | Shared global index tree,    | single indexing tree and     |
> |                               | fully utilizing pmem         |                              | global wc_lock               |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | IOPS (4K randwrite 32 numjobs)| 2107K                        | 352K                         | 283K                         |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Read Cache Support            | YES                          | YES                          | NO                           |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Deployment Flexibility        | No reformat needed           | Requires formatting backend  | Depends on dm framework,     |
> |                               |                              | devices                      | less intuitive to deploy     |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Writeback Model               | log-structure; preserves     | no guarantee between         | no guarantee writeback       |
> |                               | backing crash-consistency;   | flush order and app IO order;| ordering                     |
> |                               | important for checkpoint     | may lose ordering in backing |                              |
> +-------------------------------+------------------------------+------------------------------+------------------------------+
> | Data Integrity                | CRC on both metadata and     | CRC on metadata only         | No CRC                       |
> |                               | data (data crc is optional)  |                              |                              |
> +-------------------------------+------------------------------+------------------------------+------------------------------+

Thanks for making the comparison chart. The immediate question this
raises is why not add "multi-tree per backend", "log structured
writeback", "readcache", and "CRC" support to dm-writecache?
device-mapper is everywhere, has a long track record, and enhancing it
immediately engages a community of folks in this space.

Then reviewers can spend the time purely on the enhancements and not
reviewing a new block device-management stacking ABI.
Jens Axboe April 16, 2025, 1:04 a.m. UTC | #2
On 4/15/25 12:00 PM, Dan Williams wrote:
> Thanks for making the comparison chart. The immediate question this
> raises is why not add "multi-tree per backend", "log structured
> writeback", "readcache", and "CRC" support to dm-writecache?
> device-mapper is everywhere, has a long track record, and enhancing it
> immediately engages a community of folks in this space.

Strongly agree.
Dongsheng Yang April 16, 2025, 6:08 a.m. UTC | #3
On 2025/4/16 9:04, Jens Axboe wrote:
> On 4/15/25 12:00 PM, Dan Williams wrote:
>> Thanks for making the comparison chart. The immediate question this
>> raises is why not add "multi-tree per backend", "log structured
>> writeback", "readcache", and "CRC" support to dm-writecache?
>> device-mapper is everywhere, has a long track record, and enhancing it
>> immediately engages a community of folks in this space.
> Strongly agree.


Hi Dan and Jens,
Thanks for your reply, that's a good question.

     1. Why not optimize within dm-writecache?
     From my perspective, the design goal of dm-writecache is to be a 
minimal write cache. It achieves caching by dividing the cache device 
into n blocks, each managed by a wc_entry, using a very simple 
management mechanism. On top of this design, it's quite difficult to 
implement features like multi-tree structures, CRC, or log-structured 
writeback. Moreover, adding such optimizations—especially a read 
cache—would deviate from the original semantics of dm-writecache. So, we 
didn't consider optimizing dm-writecache to meet our goals.

     2. Why not optimize within bcache or dm-cache?
     As mentioned above, dm-writecache is essentially a minimal write 
cache. So, why not build on bcache or dm-cache, which are more complete 
caching systems? The truth is, it's also quite difficult. These systems 
were designed with traditional SSDs/NVMe in mind, and many of their 
design assumptions no longer hold true in the context of PMEM. Every 
design targets a specific scenario, which is why, even with dm-cache 
available, dm-writecache emerged to support DAX-capable PMEM devices.

     3. Then why not implement a full PMEM cache within the dm framework?
     In high-performance IO scenarios—especially with PMEM 
hardware—adding an extra DM layer in the IO stack is often unnecessary. 
For example, DM performs a bio clone before calling __map_bio(clone) to 
invoke the target operation, which introduces overhead.

Thank you again for the suggestion. I absolutely agree that leveraging 
existing frameworks would be helpful in terms of code review, and 
merging. I, more than anyone, hope more people can help review the code 
or join in this work. However, I believe that in the long run, building 
a standalone pcache module is a better choice.

Thanx
Dongsheng

>