diff mbox series

[03/14] bcache: add initial data structures for nvm pages

Message ID 20210615054921.101421-4-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series bcache patches for Linux v5.14 | expand

Commit Message

Coly Li June 15, 2021, 5:49 a.m. UTC
This patch initializes the prototype data structures for nvm pages
allocator,

- struct bch_nvm_pages_sb
This is the super block allocated on each nvdimm namespace. A nvdimm
set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
to mark which nvdimm set this name space belongs to. Normally we will
use the bcache's cache set UUID to initialize this uuid, to connect this
nvdimm set to a specified bcache cache set.

- struct bch_owner_list_head
This is a table for all heads of all owner lists. A owner list records
which page(s) allocated to which owner. After reboot from power failure,
the ownwer may find all its requested and allocated pages from the owner
list by a handler which is converted by a UUID.

- struct bch_nvm_pages_owner_head
This is a head of an owner list. Each owner only has one owner list,
and a nvm page only belongs to an specific owner. uuid[] will be set to
owner's uuid, for bcache it is the bcache's cache set uuid. label is not
mandatory, it is a human-readable string for debug purpose. The pointer
*recs references to separated nvm page which hold the table of struct
bch_nvm_pgalloc_rec.

- struct bch_nvm_pgalloc_recs
This struct occupies a whole page, owner_uuid should match the uuid
in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
allocated records.

- struct bch_nvm_pgalloc_rec
Each structure records a range of allocated nvm pages.
  - Bits  0 - 51: is pages offset of the allocated pages.
  - Bits 52 - 57: allocaed size in page_size * order-of-2
  - Bits 58 - 63: reserved.
Since each of the allocated nvm pages are power of 2, using 6 bits to
represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
value. It can be a 76 bits width range size in byte for 4KB page size,
which is large enough currently.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Qiaowei Ren <qiaowei.ren@intel.com>
---
 include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
 1 file changed, 200 insertions(+)
 create mode 100644 include/uapi/linux/bcache-nvm.h

Comments

Coly Li June 21, 2021, 4:17 p.m. UTC | #1
Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),

I do need your help on code review for the following patch. This
series are posted and refined for 2 merge windows already but we
are lack of code review from more experienced kernel developers
like you all.

The following patch defines a set of on-NVDIMM memory objects,
which are used to support NVDIMM for bcache journalling. Currently
the testing hardware is Intel AEP (Apache Pass).

Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
allocator for NVDIMM pages, then we allocate non-volatiled memory
pages from NVDIMM to store bcache journal set. Then the jouranling
can be very fast and after system reboots, once the NVDIMM mapping
is done, bcache code can directly reference journal set memory
object without loading them via block layer interface.

In order to restore allocated non-volatile memory, we use a set of
list (named owner list) to trace all allocated non-volatile memory
pages identified by UUID. Just like the bcache journal set, the list
stored in NVDIMM and accessed directly as typical in-memory list,
the only difference is they are non-volatiled: we access the lists
directly from NVDIMM, update the list in-place.

This is why you can see pointers are defined in struct
bch_nvm_pgalloc_recs, because such object is reference directly as
memory object, and stored directly onto NVDIMM.

Current patch series works as expected with limited data-set on
both bcache-tools and patched kernel. Because the bcache btree nodes
are not stored onto NVDIMM yet, journaling for leaf node splitting
will be handled in later series.

The whole work of supporting NVDIMM for bcache will involve in,
- Storing bcache journal on NVDIMM
- Store bcache btree nodes on NVDIMM
- Store cached data on NVDIMM.
- On-NVDIMM objects consistency for power failure

In order to make the code review to be more easier, the first step
we submit storing journal on NVDIMM into upstream firstly, following
work will be submitted step by step.

Jens wants more widely review before taking this series into bcache
upstream, and you are all the experts I trust and have my respect.

I do ask for help of code review from you all. Especially for the
following particular data structure definition patch, because I
define pointers in memory structures and reference and store them on
the NVDIMM.

Thanks in advance for your help.

Coly Li




On 6/15/21 1:49 PM, Coly Li wrote:
> This patch initializes the prototype data structures for nvm pages
> allocator,
>
> - struct bch_nvm_pages_sb
> This is the super block allocated on each nvdimm namespace. A nvdimm
> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
> to mark which nvdimm set this name space belongs to. Normally we will
> use the bcache's cache set UUID to initialize this uuid, to connect this
> nvdimm set to a specified bcache cache set.
>
> - struct bch_owner_list_head
> This is a table for all heads of all owner lists. A owner list records
> which page(s) allocated to which owner. After reboot from power failure,
> the ownwer may find all its requested and allocated pages from the owner
> list by a handler which is converted by a UUID.
>
> - struct bch_nvm_pages_owner_head
> This is a head of an owner list. Each owner only has one owner list,
> and a nvm page only belongs to an specific owner. uuid[] will be set to
> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> mandatory, it is a human-readable string for debug purpose. The pointer
> *recs references to separated nvm page which hold the table of struct
> bch_nvm_pgalloc_rec.
>
> - struct bch_nvm_pgalloc_recs
> This struct occupies a whole page, owner_uuid should match the uuid
> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> allocated records.
>
> - struct bch_nvm_pgalloc_rec
> Each structure records a range of allocated nvm pages.
>   - Bits  0 - 51: is pages offset of the allocated pages.
>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>   - Bits 58 - 63: reserved.
> Since each of the allocated nvm pages are power of 2, using 6 bits to
> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
> value. It can be a 76 bits width range size in byte for 4KB page size,
> which is large enough currently.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>  1 file changed, 200 insertions(+)
>  create mode 100644 include/uapi/linux/bcache-nvm.h
>
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> new file mode 100644
> index 000000000000..5094a6797679
> --- /dev/null
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -0,0 +1,200 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_BCACHE_NVM_H
> +#define _UAPI_BCACHE_NVM_H
> +
> +#if (__BITS_PER_LONG == 64)
> +/*
> + * Bcache on NVDIMM data structures
> + */
> +
> +/*
> + * - struct bch_nvm_pages_sb
> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
> + * which nvdimm set this name space belongs to. Normally we will use the
> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
> + * set to a specified bcache cache set.
> + *
> + * - struct bch_owner_list_head
> + *   This is a table for all heads of all owner lists. A owner list records
> + * which page(s) allocated to which owner. After reboot from power failure,
> + * the ownwer may find all its requested and allocated pages from the owner
> + * list by a handler which is converted by a UUID.
> + *
> + * - struct bch_nvm_pages_owner_head
> + *   This is a head of an owner list. Each owner only has one owner list,
> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> + * mandatory, it is a human-readable string for debug purpose. The pointer
> + * recs references to separated nvm page which hold the table of struct
> + * bch_pgalloc_rec.
> + *
> + *- struct bch_nvm_pgalloc_recs
> + *  This structure occupies a whole page, owner_uuid should match the uuid
> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> + * allocated records.
> + *
> + * - struct bch_pgalloc_rec
> + *   Each structure records a range of allocated nvm pages. pgoff is offset
> + * in unit of page size of this allocated nvm page range. The adjoint page
> + * ranges of same owner can be merged into a larger one, therefore pages_nr
> + * is NOT always power of 2.
> + *
> + *
> + * Memory layout on nvdimm namespace 0
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
> + *      |       bch_owner_list_head       |
> + *      |                                 |
> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
> + *      |       bch_nvm_pgalloc_recs      |
> + *      |  (nvm pages internal usage)     |
> + * 24KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + *
> + *
> + * Memory layout on nvdimm namespace N
> + * (doesn't have owner list)
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + */
> +
> +#include <linux/types.h>
> +
> +/* In sectors */
> +#define BCH_NVM_PAGES_SB_OFFSET			4096
> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
> +
> +#define BCH_NVM_PAGES_LABEL_SIZE		32
> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
> +
> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
> +
> +#define BCH_NVM_PAGES_SB_VERSION		0
> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
> +
> +static const unsigned char bch_nvm_pages_magic[] = {
> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
> +
> +/* takes 64bit width */
> +struct bch_pgalloc_rec {
> +	__u64	pgoff:52;
> +	__u64	order:6;
> +	__u64	reserved:6;
> +};
> +
> +struct bch_nvm_pgalloc_recs {
> +union {
> +	struct {
> +		struct bch_nvm_pages_owner_head	*owner;
> +		struct bch_nvm_pgalloc_recs	*next;
> +		unsigned char			magic[16];
> +		unsigned char			owner_uuid[16];
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_pgalloc_rec		recs[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +
> +#define BCH_MAX_RECS					\
> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
> +	 sizeof(struct bch_pgalloc_rec))
> +
> +struct bch_nvm_pages_owner_head {
> +	unsigned char			uuid[16];
> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
> +	/* Per-namespace own lists */
> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
> +};
> +
> +/* heads[0] is always for nvm_pages internal usage */
> +struct bch_owner_list_head {
> +union {
> +	struct {
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_nvm_pages_owner_head	heads[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +#define BCH_MAX_OWNER_LIST				\
> +	((sizeof(struct bch_owner_list_head) -		\
> +	 offsetof(struct bch_owner_list_head, heads)) /	\
> +	 sizeof(struct bch_nvm_pages_owner_head))
> +
> +/* The on-media bit order is local CPU order */
> +struct bch_nvm_pages_sb {
> +	unsigned long				csum;
> +	unsigned long				ns_start;
> +	unsigned long				sb_offset;
> +	unsigned long				version;
> +	unsigned char				magic[16];
> +	unsigned char				uuid[16];
> +	unsigned int				page_size;
> +	unsigned int				total_namespaces_nr;
> +	unsigned int				this_namespace_nr;
> +	union {
> +		unsigned char			set_uuid[16];
> +		unsigned long			set_magic;
> +	};
> +
> +	unsigned long				flags;
> +	unsigned long				seq;
> +
> +	unsigned long				feature_compat;
> +	unsigned long				feature_incompat;
> +	unsigned long				feature_ro_compat;
> +
> +	/* For allocable nvm pages from buddy systems */
> +	unsigned long				pages_offset;
> +	unsigned long				pages_total;
> +
> +	unsigned long				pad[8];
> +
> +	/* Only on the first name space */
> +	struct bch_owner_list_head		*owner_list_head;
> +
> +	/* Just for csum_set() */
> +	unsigned int				keys;
> +	unsigned long				d[0];
> +};
> +#endif /* __BITS_PER_LONG == 64 */
> +
> +#endif /* _UAPI_BCACHE_NVM_H */
Huang, Ying June 22, 2021, 8:41 a.m. UTC | #2
Coly Li <colyli@suse.de> writes:

> Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),
>
> I do need your help on code review for the following patch. This
> series are posted and refined for 2 merge windows already but we
> are lack of code review from more experienced kernel developers
> like you all.
>
> The following patch defines a set of on-NVDIMM memory objects,
> which are used to support NVDIMM for bcache journalling. Currently
> the testing hardware is Intel AEP (Apache Pass).
>
> Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
> allocator for NVDIMM pages, then we allocate non-volatiled memory
> pages from NVDIMM to store bcache journal set. Then the jouranling
> can be very fast and after system reboots, once the NVDIMM mapping
> is done, bcache code can directly reference journal set memory
> object without loading them via block layer interface.
>
> In order to restore allocated non-volatile memory, we use a set of
> list (named owner list) to trace all allocated non-volatile memory
> pages identified by UUID. Just like the bcache journal set, the list
> stored in NVDIMM and accessed directly as typical in-memory list,
> the only difference is they are non-volatiled: we access the lists
> directly from NVDIMM, update the list in-place.
>
> This is why you can see pointers are defined in struct
> bch_nvm_pgalloc_recs, because such object is reference directly as
> memory object, and stored directly onto NVDIMM.
>
> Current patch series works as expected with limited data-set on
> both bcache-tools and patched kernel. Because the bcache btree nodes
> are not stored onto NVDIMM yet, journaling for leaf node splitting
> will be handled in later series.
>
> The whole work of supporting NVDIMM for bcache will involve in,
> - Storing bcache journal on NVDIMM
> - Store bcache btree nodes on NVDIMM
> - Store cached data on NVDIMM.
> - On-NVDIMM objects consistency for power failure
>
> In order to make the code review to be more easier, the first step
> we submit storing journal on NVDIMM into upstream firstly, following
> work will be submitted step by step.
>
> Jens wants more widely review before taking this series into bcache
> upstream, and you are all the experts I trust and have my respect.
>
> I do ask for help of code review from you all. Especially for the
> following particular data structure definition patch, because I
> define pointers in memory structures and reference and store them on
> the NVDIMM.
>
> Thanks in advance for your help.
>
> Coly Li
>
>
>
>
> On 6/15/21 1:49 PM, Coly Li wrote:
>> This patch initializes the prototype data structures for nvm pages
>> allocator,
>>
>> - struct bch_nvm_pages_sb
>> This is the super block allocated on each nvdimm namespace. A nvdimm
>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>> to mark which nvdimm set this name space belongs to. Normally we will
>> use the bcache's cache set UUID to initialize this uuid, to connect this
>> nvdimm set to a specified bcache cache set.
>>
>> - struct bch_owner_list_head
>> This is a table for all heads of all owner lists. A owner list records
>> which page(s) allocated to which owner. After reboot from power failure,
>> the ownwer may find all its requested and allocated pages from the owner
>> list by a handler which is converted by a UUID.
>>
>> - struct bch_nvm_pages_owner_head
>> This is a head of an owner list. Each owner only has one owner list,
>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> mandatory, it is a human-readable string for debug purpose. The pointer
>> *recs references to separated nvm page which hold the table of struct
>> bch_nvm_pgalloc_rec.
>>
>> - struct bch_nvm_pgalloc_recs
>> This struct occupies a whole page, owner_uuid should match the uuid
>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> allocated records.
>>
>> - struct bch_nvm_pgalloc_rec
>> Each structure records a range of allocated nvm pages.
>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>   - Bits 58 - 63: reserved.
>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>> value. It can be a 76 bits width range size in byte for 4KB page size,
>> which is large enough currently.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>  1 file changed, 200 insertions(+)
>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>
>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>> new file mode 100644
>> index 000000000000..5094a6797679
>> --- /dev/null
>> +++ b/include/uapi/linux/bcache-nvm.h
>> @@ -0,0 +1,200 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +
>> +#ifndef _UAPI_BCACHE_NVM_H
>> +#define _UAPI_BCACHE_NVM_H
>> +
>> +#if (__BITS_PER_LONG == 64)
>> +/*
>> + * Bcache on NVDIMM data structures
>> + */
>> +
>> +/*
>> + * - struct bch_nvm_pages_sb
>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>> + * which nvdimm set this name space belongs to. Normally we will use the
>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>> + * set to a specified bcache cache set.
>> + *
>> + * - struct bch_owner_list_head
>> + *   This is a table for all heads of all owner lists. A owner list records
>> + * which page(s) allocated to which owner. After reboot from power failure,
>> + * the ownwer may find all its requested and allocated pages from the owner
>> + * list by a handler which is converted by a UUID.
>> + *
>> + * - struct bch_nvm_pages_owner_head
>> + *   This is a head of an owner list. Each owner only has one owner list,
>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>> + * recs references to separated nvm page which hold the table of struct
>> + * bch_pgalloc_rec.
>> + *
>> + *- struct bch_nvm_pgalloc_recs
>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> + * allocated records.
>> + *
>> + * - struct bch_pgalloc_rec
>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>> + * in unit of page size of this allocated nvm page range. The adjoint page
>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>> + * is NOT always power of 2.
>> + *
>> + *
>> + * Memory layout on nvdimm namespace 0
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>> + *      |       bch_owner_list_head       |
>> + *      |                                 |
>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>> + *      |       bch_nvm_pgalloc_recs      |
>> + *      |  (nvm pages internal usage)     |
>> + * 24KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + *
>> + *
>> + * Memory layout on nvdimm namespace N
>> + * (doesn't have owner list)
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +/* In sectors */
>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>> +
>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>> +
>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>> +
>> +#define BCH_NVM_PAGES_SB_VERSION		0
>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>> +
>> +static const unsigned char bch_nvm_pages_magic[] = {
>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>> +
>> +/* takes 64bit width */
>> +struct bch_pgalloc_rec {
>> +	__u64	pgoff:52;
>> +	__u64	order:6;
>> +	__u64	reserved:6;
>> +};
>> +
>> +struct bch_nvm_pgalloc_recs {
>> +union {
>> +	struct {
>> +		struct bch_nvm_pages_owner_head	*owner;
>> +		struct bch_nvm_pgalloc_recs	*next;

I have concerns about using pointers directly in on-NVDIMM data
structure too.  How can you guarantee the NVDIMM devices will be mapped
to exact same virtual address across reboot?

Best Regards,
Huang, Ying

>> +		unsigned char			magic[16];
>> +		unsigned char			owner_uuid[16];
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_pgalloc_rec		recs[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +
>> +#define BCH_MAX_RECS					\
>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>> +	 sizeof(struct bch_pgalloc_rec))
>> +
>> +struct bch_nvm_pages_owner_head {
>> +	unsigned char			uuid[16];
>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>> +	/* Per-namespace own lists */
>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>> +};
>> +
>> +/* heads[0] is always for nvm_pages internal usage */
>> +struct bch_owner_list_head {
>> +union {
>> +	struct {
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_nvm_pages_owner_head	heads[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +#define BCH_MAX_OWNER_LIST				\
>> +	((sizeof(struct bch_owner_list_head) -		\
>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>> +	 sizeof(struct bch_nvm_pages_owner_head))
>> +
>> +/* The on-media bit order is local CPU order */
>> +struct bch_nvm_pages_sb {
>> +	unsigned long				csum;
>> +	unsigned long				ns_start;
>> +	unsigned long				sb_offset;
>> +	unsigned long				version;
>> +	unsigned char				magic[16];
>> +	unsigned char				uuid[16];
>> +	unsigned int				page_size;
>> +	unsigned int				total_namespaces_nr;
>> +	unsigned int				this_namespace_nr;
>> +	union {
>> +		unsigned char			set_uuid[16];
>> +		unsigned long			set_magic;
>> +	};
>> +
>> +	unsigned long				flags;
>> +	unsigned long				seq;
>> +
>> +	unsigned long				feature_compat;
>> +	unsigned long				feature_incompat;
>> +	unsigned long				feature_ro_compat;
>> +
>> +	/* For allocable nvm pages from buddy systems */
>> +	unsigned long				pages_offset;
>> +	unsigned long				pages_total;
>> +
>> +	unsigned long				pad[8];
>> +
>> +	/* Only on the first name space */
>> +	struct bch_owner_list_head		*owner_list_head;
>> +
>> +	/* Just for csum_set() */
>> +	unsigned int				keys;
>> +	unsigned long				d[0];
>> +};
>> +#endif /* __BITS_PER_LONG == 64 */
>> +
>> +#endif /* _UAPI_BCACHE_NVM_H */
Hannes Reinecke June 22, 2021, 10:19 a.m. UTC | #3
On 6/15/21 7:49 AM, Coly Li wrote:
> This patch initializes the prototype data structures for nvm pages
> allocator,
> 
> - struct bch_nvm_pages_sb
> This is the super block allocated on each nvdimm namespace. A nvdimm
> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
> to mark which nvdimm set this name space belongs to. Normally we will
> use the bcache's cache set UUID to initialize this uuid, to connect this
> nvdimm set to a specified bcache cache set.
> 
> - struct bch_owner_list_head
> This is a table for all heads of all owner lists. A owner list records
> which page(s) allocated to which owner. After reboot from power failure,
> the ownwer may find all its requested and allocated pages from the owner

owner

> list by a handler which is converted by a UUID.
> 
> - struct bch_nvm_pages_owner_head
> This is a head of an owner list. Each owner only has one owner list,
> and a nvm page only belongs to an specific owner. uuid[] will be set to
> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> mandatory, it is a human-readable string for debug purpose. The pointer
> *recs references to separated nvm page which hold the table of struct
> bch_nvm_pgalloc_rec.
> 
> - struct bch_nvm_pgalloc_recs
> This struct occupies a whole page, owner_uuid should match the uuid
> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> allocated records.
> 
> - struct bch_nvm_pgalloc_rec
> Each structure records a range of allocated nvm pages.
>   - Bits  0 - 51: is pages offset of the allocated pages.
>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>   - Bits 58 - 63: reserved.
> Since each of the allocated nvm pages are power of 2, using 6 bits to
> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
> value. It can be a 76 bits width range size in byte for 4KB page size,
> which is large enough currently.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
> ---
>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>  1 file changed, 200 insertions(+)
>  create mode 100644 include/uapi/linux/bcache-nvm.h
> 
> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
> new file mode 100644
> index 000000000000..5094a6797679
> --- /dev/null
> +++ b/include/uapi/linux/bcache-nvm.h
> @@ -0,0 +1,200 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +#ifndef _UAPI_BCACHE_NVM_H
> +#define _UAPI_BCACHE_NVM_H
> +
> +#if (__BITS_PER_LONG == 64)
> +/*
> + * Bcache on NVDIMM data structures
> + */
> +
> +/*
> + * - struct bch_nvm_pages_sb
> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
> + * which nvdimm set this name space belongs to. Normally we will use the
> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
> + * set to a specified bcache cache set.
> + *
> + * - struct bch_owner_list_head
> + *   This is a table for all heads of all owner lists. A owner list records
> + * which page(s) allocated to which owner. After reboot from power failure,
> + * the ownwer may find all its requested and allocated pages from the owner
> + * list by a handler which is converted by a UUID.
> + *
> + * - struct bch_nvm_pages_owner_head
> + *   This is a head of an owner list. Each owner only has one owner list,
> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
> + * mandatory, it is a human-readable string for debug purpose. The pointer
> + * recs references to separated nvm page which hold the table of struct
> + * bch_pgalloc_rec.
> + *
> + *- struct bch_nvm_pgalloc_recs
> + *  This structure occupies a whole page, owner_uuid should match the uuid
> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
> + * allocated records.
> + *
> + * - struct bch_pgalloc_rec
> + *   Each structure records a range of allocated nvm pages. pgoff is offset
> + * in unit of page size of this allocated nvm page range. The adjoint page
> + * ranges of same owner can be merged into a larger one, therefore pages_nr
> + * is NOT always power of 2.
> + *
> + *
> + * Memory layout on nvdimm namespace 0
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
> + *      |       bch_owner_list_head       |
> + *      |                                 |
> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
> + *      |       bch_nvm_pgalloc_recs      |
> + *      |  (nvm pages internal usage)     |
> + * 24KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + *
> + *
> + * Memory layout on nvdimm namespace N
> + * (doesn't have owner list)
> + *
> + *    0 +---------------------------------+
> + *      |                                 |
> + *  4KB +---------------------------------+
> + *      |         bch_nvm_pages_sb        |
> + *  8KB +---------------------------------+
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + *      |                                 |
> + * 16MB  +---------------------------------+
> + *      |      allocable nvm pages        |
> + *      |      for buddy allocator        |
> + * end  +---------------------------------+
> + *
> + */
> +
> +#include <linux/types.h>
> +
> +/* In sectors */
> +#define BCH_NVM_PAGES_SB_OFFSET			4096
> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
> +
> +#define BCH_NVM_PAGES_LABEL_SIZE		32
> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
> +
> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
> +
> +#define BCH_NVM_PAGES_SB_VERSION		0
> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
> +
> +static const unsigned char bch_nvm_pages_magic[] = {
> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
> +
> +/* takes 64bit width */
> +struct bch_pgalloc_rec {
> +	__u64	pgoff:52;
> +	__u64	order:6;
> +	__u64	reserved:6;
> +};
> +
> +struct bch_nvm_pgalloc_recs {
> +union {

Indentation.

> +	struct {
> +		struct bch_nvm_pages_owner_head	*owner;
> +		struct bch_nvm_pgalloc_recs	*next;
> +		unsigned char			magic[16];
> +		unsigned char			owner_uuid[16];
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_pgalloc_rec		recs[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +

Consider using __u64 and friends when specifying a structure with a
fixed alignment; that also removes the need of the BITS_PER_LONG ifdef
at the top.

> +#define BCH_MAX_RECS					\
> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
> +	 sizeof(struct bch_pgalloc_rec))
> +

What _are_ you doing here?
You're not seriously using the 'pad' field as a placeholder to size the
structure accordingly?
Also, what is the size of the 'bch_nvm_pgalloc_recs' structure?
8k + header size?
That is very awkward, as the page allocator won't be able to handle it
efficiently.
Please size it to either 8k or 16k overall.
And if you do that you can simplify this define.

> +struct bch_nvm_pages_owner_head {
> +	unsigned char			uuid[16];
> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
> +	/* Per-namespace own lists */
> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
> +};
> +
> +/* heads[0] is always for nvm_pages internal usage */
> +struct bch_owner_list_head {
> +union {
> +	struct {
> +		unsigned int			size;
> +		unsigned int			used;
> +		unsigned long			_pad[4];
> +		struct bch_nvm_pages_owner_head	heads[];
> +	};
> +	unsigned char				pad[8192];
> +};
> +};
> +#define BCH_MAX_OWNER_LIST				\
> +	((sizeof(struct bch_owner_list_head) -		\
> +	 offsetof(struct bch_owner_list_head, heads)) /	\
> +	 sizeof(struct bch_nvm_pages_owner_head))
> +

Same here.
Please size it that the 'bch_owner_list_head' structure fits into either
8k or 16k.

> +/* The on-media bit order is local CPU order */
> +struct bch_nvm_pages_sb {
> +	unsigned long				csum;
> +	unsigned long				ns_start;
> +	unsigned long				sb_offset;
> +	unsigned long				version;
> +	unsigned char				magic[16];
> +	unsigned char				uuid[16];
> +	unsigned int				page_size;
> +	unsigned int				total_namespaces_nr;
> +	unsigned int				this_namespace_nr;
> +	union {
> +		unsigned char			set_uuid[16];
> +		unsigned long			set_magic;
> +	};
> +
> +	unsigned long				flags;
> +	unsigned long				seq;
> +
> +	unsigned long				feature_compat;
> +	unsigned long				feature_incompat;
> +	unsigned long				feature_ro_compat;
> +
> +	/* For allocable nvm pages from buddy systems */
> +	unsigned long				pages_offset;
> +	unsigned long				pages_total;
> +
> +	unsigned long				pad[8];
> +
> +	/* Only on the first name space */
> +	struct bch_owner_list_head		*owner_list_head;
> +
> +	/* Just for csum_set() */
> +	unsigned int				keys;
> +	unsigned long				d[0];
> +};

And also here, use __u64 and friends.

Cheers,

Hannes
Coly Li June 23, 2021, 4:32 a.m. UTC | #4
Hi Ying,

I reply your comment in-place where you commented on.

On 6/22/21 4:41 PM, Huang, Ying wrote:
> Coly Li <colyli@suse.de> writes:
>
>> Hi all my dear receivers (Christoph, Dan, Hannes, Jan and Ying),
>>
>> I do need your help on code review for the following patch. This
>> series are posted and refined for 2 merge windows already but we
>> are lack of code review from more experienced kernel developers
>> like you all.
>>
>> The following patch defines a set of on-NVDIMM memory objects,
>> which are used to support NVDIMM for bcache journalling. Currently
>> the testing hardware is Intel AEP (Apache Pass).
>>
>> Qiangwei Ren and Jianpeng Ma work with me to compose a mini pages
>> allocator for NVDIMM pages, then we allocate non-volatiled memory
>> pages from NVDIMM to store bcache journal set. Then the jouranling
>> can be very fast and after system reboots, once the NVDIMM mapping
>> is done, bcache code can directly reference journal set memory
>> object without loading them via block layer interface.
>>
>> In order to restore allocated non-volatile memory, we use a set of
>> list (named owner list) to trace all allocated non-volatile memory
>> pages identified by UUID. Just like the bcache journal set, the list
>> stored in NVDIMM and accessed directly as typical in-memory list,
>> the only difference is they are non-volatiled: we access the lists
>> directly from NVDIMM, update the list in-place.
>>
>> This is why you can see pointers are defined in struct
>> bch_nvm_pgalloc_recs, because such object is reference directly as
>> memory object, and stored directly onto NVDIMM.
>>
>> Current patch series works as expected with limited data-set on
>> both bcache-tools and patched kernel. Because the bcache btree nodes
>> are not stored onto NVDIMM yet, journaling for leaf node splitting
>> will be handled in later series.
>>
>> The whole work of supporting NVDIMM for bcache will involve in,
>> - Storing bcache journal on NVDIMM
>> - Store bcache btree nodes on NVDIMM
>> - Store cached data on NVDIMM.
>> - On-NVDIMM objects consistency for power failure
>>
>> In order to make the code review to be more easier, the first step
>> we submit storing journal on NVDIMM into upstream firstly, following
>> work will be submitted step by step.
>>
>> Jens wants more widely review before taking this series into bcache
>> upstream, and you are all the experts I trust and have my respect.
>>
>> I do ask for help of code review from you all. Especially for the
>> following particular data structure definition patch, because I
>> define pointers in memory structures and reference and store them on
>> the NVDIMM.
>>
>> Thanks in advance for your help.
>>
>> Coly Li
>>
>>
>>
>>
>> On 6/15/21 1:49 PM, Coly Li wrote:
>>> This patch initializes the prototype data structures for nvm pages
>>> allocator,
>>>
>>> - struct bch_nvm_pages_sb
>>> This is the super block allocated on each nvdimm namespace. A nvdimm
>>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>>> to mark which nvdimm set this name space belongs to. Normally we will
>>> use the bcache's cache set UUID to initialize this uuid, to connect this
>>> nvdimm set to a specified bcache cache set.
>>>
>>> - struct bch_owner_list_head
>>> This is a table for all heads of all owner lists. A owner list records
>>> which page(s) allocated to which owner. After reboot from power failure,
>>> the ownwer may find all its requested and allocated pages from the owner
>>> list by a handler which is converted by a UUID.
>>>
>>> - struct bch_nvm_pages_owner_head
>>> This is a head of an owner list. Each owner only has one owner list,
>>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>> mandatory, it is a human-readable string for debug purpose. The pointer
>>> *recs references to separated nvm page which hold the table of struct
>>> bch_nvm_pgalloc_rec.
>>>
>>> - struct bch_nvm_pgalloc_recs
>>> This struct occupies a whole page, owner_uuid should match the uuid
>>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>> allocated records.
>>>
>>> - struct bch_nvm_pgalloc_rec
>>> Each structure records a range of allocated nvm pages.
>>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>>   - Bits 58 - 63: reserved.
>>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>>> value. It can be a 76 bits width range size in byte for 4KB page size,
>>> which is large enough currently.
>>>
>>> Signed-off-by: Coly Li <colyli@suse.de>
>>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>>> ---
>>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>>  1 file changed, 200 insertions(+)
>>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>>
>>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>>> new file mode 100644
>>> index 000000000000..5094a6797679
>>> --- /dev/null
>>> +++ b/include/uapi/linux/bcache-nvm.h
>>> @@ -0,0 +1,200 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +
>>> +#ifndef _UAPI_BCACHE_NVM_H
>>> +#define _UAPI_BCACHE_NVM_H
>>> +
>>> +#if (__BITS_PER_LONG == 64)
>>> +/*
>>> + * Bcache on NVDIMM data structures
>>> + */
>>> +
>>> +/*
>>> + * - struct bch_nvm_pages_sb
>>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>>> + * which nvdimm set this name space belongs to. Normally we will use the
>>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>>> + * set to a specified bcache cache set.
>>> + *
>>> + * - struct bch_owner_list_head
>>> + *   This is a table for all heads of all owner lists. A owner list records
>>> + * which page(s) allocated to which owner. After reboot from power failure,
>>> + * the ownwer may find all its requested and allocated pages from the owner
>>> + * list by a handler which is converted by a UUID.
>>> + *
>>> + * - struct bch_nvm_pages_owner_head
>>> + *   This is a head of an owner list. Each owner only has one owner list,
>>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>>> + * recs references to separated nvm page which hold the table of struct
>>> + * bch_pgalloc_rec.
>>> + *
>>> + *- struct bch_nvm_pgalloc_recs
>>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>> + * allocated records.
>>> + *
>>> + * - struct bch_pgalloc_rec
>>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>>> + * in unit of page size of this allocated nvm page range. The adjoint page
>>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>>> + * is NOT always power of 2.
>>> + *
>>> + *
>>> + * Memory layout on nvdimm namespace 0
>>> + *
>>> + *    0 +---------------------------------+
>>> + *      |                                 |
>>> + *  4KB +---------------------------------+
>>> + *      |         bch_nvm_pages_sb        |
>>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>>> + *      |       bch_owner_list_head       |
>>> + *      |                                 |
>>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>>> + *      |       bch_nvm_pgalloc_recs      |
>>> + *      |  (nvm pages internal usage)     |
>>> + * 24KB +---------------------------------+
>>> + *      |                                 |
>>> + *      |                                 |
>>> + * 16MB  +---------------------------------+
>>> + *      |      allocable nvm pages        |
>>> + *      |      for buddy allocator        |
>>> + * end  +---------------------------------+
>>> + *
>>> + *
>>> + *
>>> + * Memory layout on nvdimm namespace N
>>> + * (doesn't have owner list)
>>> + *
>>> + *    0 +---------------------------------+
>>> + *      |                                 |
>>> + *  4KB +---------------------------------+
>>> + *      |         bch_nvm_pages_sb        |
>>> + *  8KB +---------------------------------+
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + *      |                                 |
>>> + * 16MB  +---------------------------------+
>>> + *      |      allocable nvm pages        |
>>> + *      |      for buddy allocator        |
>>> + * end  +---------------------------------+
>>> + *
>>> + */
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +/* In sectors */
>>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>>> +
>>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>>> +
>>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>>> +
>>> +#define BCH_NVM_PAGES_SB_VERSION		0
>>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>>> +
>>> +static const unsigned char bch_nvm_pages_magic[] = {
>>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>>> +
>>> +/* takes 64bit width */
>>> +struct bch_pgalloc_rec {
>>> +	__u64	pgoff:52;
>>> +	__u64	order:6;
>>> +	__u64	reserved:6;
>>> +};
>>> +
>>> +struct bch_nvm_pgalloc_recs {
>>> +union {
>>> +	struct {
>>> +		struct bch_nvm_pages_owner_head	*owner;
>>> +		struct bch_nvm_pgalloc_recs	*next;
> I have concerns about using pointers directly in on-NVDIMM data
> structure too.  How can you guarantee the NVDIMM devices will be mapped
> to exact same virtual address across reboot?
>
> Best Regards,
> Huang, Ying


We use the NVDIMM name space as memory, and from our testing and
observation, the DAX mapping base address is consistent if the NVDIMM
address from e820 table does not change.

And from our testing and observation, the NVDIMM address from e820
table does not change when,
- NVDIMM and DRAM memory population does not change
- Install more NVDIMM and/or DRAM based on existing memory population
- NVDIMM always plugged in same slot location and no movement or swap
- No CPU remove and change

For 99.9%+ time when the hardware working healthily, the above condition
can be assumed. Therefore we choose to store whole linear address
(pointer) here, other than relative offset inside the NVDIMM name space.

For the 0.0?% condition if the NVDIMM address from e820 table changes,
because the last time DAX map address is stored in ns_start of struct
bch_nvm_pages_sb, if the new DAX mapping address is different from
ns_srart value, all pointers in the owner list can be updated by,
    new_addr = (old_addr - old_ns_start) + new_ns_start

The update can be very fast (and it can be power failure tolerant with
carefully coding) for. Therefore we decide to store full linear address
for directly memory access for 99%+ condition, and update the pointers
for the 0.0?% condition when DAX mapping address of the NVDIMM changes.

Handling DAX mapping address change is not current high priority task,
our next task after this series merged will be the power failure tolerance
of the owner list (from Intel developers) and storing bcache btree nodes
on NVDIMM pages (from me).

Thanks for your comments and review.

Coly Li


>>> +		unsigned char			magic[16];
>>> +		unsigned char			owner_uuid[16];
>>> +		unsigned int			size;
>>> +		unsigned int			used;
>>> +		unsigned long			_pad[4];
>>> +		struct bch_pgalloc_rec		recs[];
>>> +	};
>>> +	unsigned char				pad[8192];
>>> +};
>>> +};
>>> +
>>> +#define BCH_MAX_RECS					\
>>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>>> +	 sizeof(struct bch_pgalloc_rec))
>>> +
>>> +struct bch_nvm_pages_owner_head {
>>> +	unsigned char			uuid[16];
>>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>>> +	/* Per-namespace own lists */
>>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>>> +};
>>> +
>>> +/* heads[0] is always for nvm_pages internal usage */
>>> +struct bch_owner_list_head {
>>> +union {
>>> +	struct {
>>> +		unsigned int			size;
>>> +		unsigned int			used;
>>> +		unsigned long			_pad[4];
>>> +		struct bch_nvm_pages_owner_head	heads[];
>>> +	};
>>> +	unsigned char				pad[8192];
>>> +};
>>> +};
>>> +#define BCH_MAX_OWNER_LIST				\
>>> +	((sizeof(struct bch_owner_list_head) -		\
>>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>>> +	 sizeof(struct bch_nvm_pages_owner_head))
>>> +
>>> +/* The on-media bit order is local CPU order */
>>> +struct bch_nvm_pages_sb {
>>> +	unsigned long				csum;
>>> +	unsigned long				ns_start;
>>> +	unsigned long				sb_offset;
>>> +	unsigned long				version;
>>> +	unsigned char				magic[16];
>>> +	unsigned char				uuid[16];
>>> +	unsigned int				page_size;
>>> +	unsigned int				total_namespaces_nr;
>>> +	unsigned int				this_namespace_nr;
>>> +	union {
>>> +		unsigned char			set_uuid[16];
>>> +		unsigned long			set_magic;
>>> +	};
>>> +
>>> +	unsigned long				flags;
>>> +	unsigned long				seq;
>>> +
>>> +	unsigned long				feature_compat;
>>> +	unsigned long				feature_incompat;
>>> +	unsigned long				feature_ro_compat;
>>> +
>>> +	/* For allocable nvm pages from buddy systems */
>>> +	unsigned long				pages_offset;
>>> +	unsigned long				pages_total;
>>> +
>>> +	unsigned long				pad[8];
>>> +
>>> +	/* Only on the first name space */
>>> +	struct bch_owner_list_head		*owner_list_head;
>>> +
>>> +	/* Just for csum_set() */
>>> +	unsigned int				keys;
>>> +	unsigned long				d[0];
>>> +};
>>> +#endif /* __BITS_PER_LONG == 64 */
>>> +
>>> +#endif /* _UAPI_BCACHE_NVM_H */
Huang, Ying June 23, 2021, 6:53 a.m. UTC | #5
Coly Li <colyli@suse.de> writes:

> Hi Ying,
>
> I reply your comment in-place where you commented on.
>
> On 6/22/21 4:41 PM, Huang, Ying wrote:
>> Coly Li <colyli@suse.de> writes:
>>

[snip]

>>>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>>>> new file mode 100644
>>>> index 000000000000..5094a6797679
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/bcache-nvm.h
>>>> @@ -0,0 +1,200 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>>> +
>>>> +#ifndef _UAPI_BCACHE_NVM_H
>>>> +#define _UAPI_BCACHE_NVM_H
>>>> +
>>>> +#if (__BITS_PER_LONG == 64)
>>>> +/*
>>>> + * Bcache on NVDIMM data structures
>>>> + */
>>>> +
>>>> +/*
>>>> + * - struct bch_nvm_pages_sb
>>>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>>>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>>>> + * which nvdimm set this name space belongs to. Normally we will use the
>>>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>>>> + * set to a specified bcache cache set.
>>>> + *
>>>> + * - struct bch_owner_list_head
>>>> + *   This is a table for all heads of all owner lists. A owner list records
>>>> + * which page(s) allocated to which owner. After reboot from power failure,
>>>> + * the ownwer may find all its requested and allocated pages from the owner
>>>> + * list by a handler which is converted by a UUID.
>>>> + *
>>>> + * - struct bch_nvm_pages_owner_head
>>>> + *   This is a head of an owner list. Each owner only has one owner list,
>>>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>>>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>>>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>>>> + * recs references to separated nvm page which hold the table of struct
>>>> + * bch_pgalloc_rec.
>>>> + *
>>>> + *- struct bch_nvm_pgalloc_recs
>>>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>>>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>>>> + * allocated records.
>>>> + *
>>>> + * - struct bch_pgalloc_rec
>>>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>>>> + * in unit of page size of this allocated nvm page range. The adjoint page
>>>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>>>> + * is NOT always power of 2.
>>>> + *
>>>> + *
>>>> + * Memory layout on nvdimm namespace 0
>>>> + *
>>>> + *    0 +---------------------------------+
>>>> + *      |                                 |
>>>> + *  4KB +---------------------------------+
>>>> + *      |         bch_nvm_pages_sb        |
>>>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>>>> + *      |       bch_owner_list_head       |
>>>> + *      |                                 |
>>>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>>>> + *      |       bch_nvm_pgalloc_recs      |
>>>> + *      |  (nvm pages internal usage)     |
>>>> + * 24KB +---------------------------------+
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + * 16MB  +---------------------------------+
>>>> + *      |      allocable nvm pages        |
>>>> + *      |      for buddy allocator        |
>>>> + * end  +---------------------------------+
>>>> + *
>>>> + *
>>>> + *
>>>> + * Memory layout on nvdimm namespace N
>>>> + * (doesn't have owner list)
>>>> + *
>>>> + *    0 +---------------------------------+
>>>> + *      |                                 |
>>>> + *  4KB +---------------------------------+
>>>> + *      |         bch_nvm_pages_sb        |
>>>> + *  8KB +---------------------------------+
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + *      |                                 |
>>>> + * 16MB  +---------------------------------+
>>>> + *      |      allocable nvm pages        |
>>>> + *      |      for buddy allocator        |
>>>> + * end  +---------------------------------+
>>>> + *
>>>> + */
>>>> +
>>>> +#include <linux/types.h>
>>>> +
>>>> +/* In sectors */
>>>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>>>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>>>> +
>>>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>>>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>>>> +
>>>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>>>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>>>> +
>>>> +#define BCH_NVM_PAGES_SB_VERSION		0
>>>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>>>> +
>>>> +static const unsigned char bch_nvm_pages_magic[] = {
>>>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>>>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>>>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>>>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>>>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>>>> +
>>>> +/* takes 64bit width */
>>>> +struct bch_pgalloc_rec {
>>>> +	__u64	pgoff:52;
>>>> +	__u64	order:6;
>>>> +	__u64	reserved:6;
>>>> +};
>>>> +
>>>> +struct bch_nvm_pgalloc_recs {
>>>> +union {
>>>> +	struct {
>>>> +		struct bch_nvm_pages_owner_head	*owner;
>>>> +		struct bch_nvm_pgalloc_recs	*next;
>> I have concerns about using pointers directly in on-NVDIMM data
>> structure too.  How can you guarantee the NVDIMM devices will be mapped
>> to exact same virtual address across reboot?
>>
>
>
> We use the NVDIMM name space as memory, and from our testing and
> observation, the DAX mapping base address is consistent if the NVDIMM
> address from e820 table does not change.
>
> And from our testing and observation, the NVDIMM address from e820
> table does not change when,
> - NVDIMM and DRAM memory population does not change
> - Install more NVDIMM and/or DRAM based on existing memory population
> - NVDIMM always plugged in same slot location and no movement or swap
> - No CPU remove and change
>
> For 99.9%+ time when the hardware working healthily, the above condition
> can be assumed. Therefore we choose to store whole linear address
> (pointer) here, other than relative offset inside the NVDIMM name space.
>
> For the 0.0?% condition if the NVDIMM address from e820 table changes,
> because the last time DAX map address is stored in ns_start of struct
> bch_nvm_pages_sb, if the new DAX mapping address is different from
> ns_srart value, all pointers in the owner list can be updated by,
>     new_addr = (old_addr - old_ns_start) + new_ns_start
>
> The update can be very fast (and it can be power failure tolerant with
> carefully coding) for. Therefore we decide to store full linear address
> for directly memory access for 99%+ condition, and update the pointers
> for the 0.0?% condition when DAX mapping address of the NVDIMM changes.
>
> Handling DAX mapping address change is not current high priority task,
> our next task after this series merged will be the power failure tolerance
> of the owner list (from Intel developers) and storing bcache btree nodes
> on NVDIMM pages (from me).

Thanks for the detailed explanation.  Given "ns_start", this should work
even when the base address changed.

So the question becomes pointer vs. offset, which one is better.  I
guess that you prefer pointer because it's easier to be used.  How about
making the pointer in the NVDIMM like the per-cpu pointer?  Which is
implemented as an offset with the pointer type.  And it's not too
hard to be used.  With that, you don't need to maintain the code to
update all pointers when the base address is changed.

Best Regards,
Huang, Ying

>>>> +		unsigned char			magic[16];
>>>> +		unsigned char			owner_uuid[16];
>>>> +		unsigned int			size;
>>>> +		unsigned int			used;
>>>> +		unsigned long			_pad[4];
>>>> +		struct bch_pgalloc_rec		recs[];
>>>> +	};
>>>> +	unsigned char				pad[8192];
>>>> +};
>>>> +};
>>>> +
>>>> +#define BCH_MAX_RECS					\
>>>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>>>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>>>> +	 sizeof(struct bch_pgalloc_rec))
>>>> +
>>>> +struct bch_nvm_pages_owner_head {
>>>> +	unsigned char			uuid[16];
>>>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>>>> +	/* Per-namespace own lists */
>>>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>>>> +};
>>>> +
>>>> +/* heads[0] is always for nvm_pages internal usage */
>>>> +struct bch_owner_list_head {
>>>> +union {
>>>> +	struct {
>>>> +		unsigned int			size;
>>>> +		unsigned int			used;
>>>> +		unsigned long			_pad[4];
>>>> +		struct bch_nvm_pages_owner_head	heads[];
>>>> +	};
>>>> +	unsigned char				pad[8192];
>>>> +};
>>>> +};
>>>> +#define BCH_MAX_OWNER_LIST				\
>>>> +	((sizeof(struct bch_owner_list_head) -		\
>>>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>>>> +	 sizeof(struct bch_nvm_pages_owner_head))
>>>> +
>>>> +/* The on-media bit order is local CPU order */
>>>> +struct bch_nvm_pages_sb {
>>>> +	unsigned long				csum;
>>>> +	unsigned long				ns_start;
>>>> +	unsigned long				sb_offset;
>>>> +	unsigned long				version;
>>>> +	unsigned char				magic[16];
>>>> +	unsigned char				uuid[16];
>>>> +	unsigned int				page_size;
>>>> +	unsigned int				total_namespaces_nr;
>>>> +	unsigned int				this_namespace_nr;
>>>> +	union {
>>>> +		unsigned char			set_uuid[16];
>>>> +		unsigned long			set_magic;
>>>> +	};
>>>> +
>>>> +	unsigned long				flags;
>>>> +	unsigned long				seq;
>>>> +
>>>> +	unsigned long				feature_compat;
>>>> +	unsigned long				feature_incompat;
>>>> +	unsigned long				feature_ro_compat;
>>>> +
>>>> +	/* For allocable nvm pages from buddy systems */
>>>> +	unsigned long				pages_offset;
>>>> +	unsigned long				pages_total;
>>>> +
>>>> +	unsigned long				pad[8];
>>>> +
>>>> +	/* Only on the first name space */
>>>> +	struct bch_owner_list_head		*owner_list_head;
>>>> +
>>>> +	/* Just for csum_set() */
>>>> +	unsigned int				keys;
>>>> +	unsigned long				d[0];
>>>> +};
>>>> +#endif /* __BITS_PER_LONG == 64 */
>>>> +
>>>> +#endif /* _UAPI_BCACHE_NVM_H */
Christoph Hellwig June 23, 2021, 7:04 a.m. UTC | #6
Storing a pointer on-media is completely broken.  It is not endian
clean, not 32-bit vs 64-bit clean and will lead to problems when addresses
change.  And they will change - maybe not often with DDR-attached
memory, but very certainly with CXL-attached memory that is completely
hot pluggable.
Coly Li June 23, 2021, 7:09 a.m. UTC | #7
On 6/22/21 6:19 PM, Hannes Reinecke wrote:
> On 6/15/21 7:49 AM, Coly Li wrote:
>> This patch initializes the prototype data structures for nvm pages
>> allocator,
>>
>> - struct bch_nvm_pages_sb
>> This is the super block allocated on each nvdimm namespace. A nvdimm
>> set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used
>> to mark which nvdimm set this name space belongs to. Normally we will
>> use the bcache's cache set UUID to initialize this uuid, to connect this
>> nvdimm set to a specified bcache cache set.
>>
>> - struct bch_owner_list_head
>> This is a table for all heads of all owner lists. A owner list records
>> which page(s) allocated to which owner. After reboot from power failure,
>> the ownwer may find all its requested and allocated pages from the owner
> owner

Fixed for next post.


>
>> list by a handler which is converted by a UUID.
>>
>> - struct bch_nvm_pages_owner_head
>> This is a head of an owner list. Each owner only has one owner list,
>> and a nvm page only belongs to an specific owner. uuid[] will be set to
>> owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> mandatory, it is a human-readable string for debug purpose. The pointer
>> *recs references to separated nvm page which hold the table of struct
>> bch_nvm_pgalloc_rec.
>>
>> - struct bch_nvm_pgalloc_recs
>> This struct occupies a whole page, owner_uuid should match the uuid
>> in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> allocated records.
>>
>> - struct bch_nvm_pgalloc_rec
>> Each structure records a range of allocated nvm pages.
>>   - Bits  0 - 51: is pages offset of the allocated pages.
>>   - Bits 52 - 57: allocaed size in page_size * order-of-2
>>   - Bits 58 - 63: reserved.
>> Since each of the allocated nvm pages are power of 2, using 6 bits to
>> represent allocated size can have (1<<(1<<64) - 1) * PAGE_SIZE maximum
>> value. It can be a 76 bits width range size in byte for 4KB page size,
>> which is large enough currently.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Jianpeng Ma <jianpeng.ma@intel.com>
>> Cc: Qiaowei Ren <qiaowei.ren@intel.com>
>> ---
>>  include/uapi/linux/bcache-nvm.h | 200 ++++++++++++++++++++++++++++++++
>>  1 file changed, 200 insertions(+)
>>  create mode 100644 include/uapi/linux/bcache-nvm.h
>>
>> diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
>> new file mode 100644
>> index 000000000000..5094a6797679
>> --- /dev/null
>> +++ b/include/uapi/linux/bcache-nvm.h
>> @@ -0,0 +1,200 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +
>> +#ifndef _UAPI_BCACHE_NVM_H
>> +#define _UAPI_BCACHE_NVM_H
>> +
>> +#if (__BITS_PER_LONG == 64)
>> +/*
>> + * Bcache on NVDIMM data structures
>> + */
>> +
>> +/*
>> + * - struct bch_nvm_pages_sb
>> + *   This is the super block allocated on each nvdimm namespace. A nvdimm
>> + * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
>> + * which nvdimm set this name space belongs to. Normally we will use the
>> + * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
>> + * set to a specified bcache cache set.
>> + *
>> + * - struct bch_owner_list_head
>> + *   This is a table for all heads of all owner lists. A owner list records
>> + * which page(s) allocated to which owner. After reboot from power failure,
>> + * the ownwer may find all its requested and allocated pages from the owner
>> + * list by a handler which is converted by a UUID.
>> + *
>> + * - struct bch_nvm_pages_owner_head
>> + *   This is a head of an owner list. Each owner only has one owner list,
>> + * and a nvm page only belongs to an specific owner. uuid[] will be set to
>> + * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
>> + * mandatory, it is a human-readable string for debug purpose. The pointer
>> + * recs references to separated nvm page which hold the table of struct
>> + * bch_pgalloc_rec.
>> + *
>> + *- struct bch_nvm_pgalloc_recs
>> + *  This structure occupies a whole page, owner_uuid should match the uuid
>> + * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
>> + * allocated records.
>> + *
>> + * - struct bch_pgalloc_rec
>> + *   Each structure records a range of allocated nvm pages. pgoff is offset
>> + * in unit of page size of this allocated nvm page range. The adjoint page
>> + * ranges of same owner can be merged into a larger one, therefore pages_nr
>> + * is NOT always power of 2.
>> + *
>> + *
>> + * Memory layout on nvdimm namespace 0
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
>> + *      |       bch_owner_list_head       |
>> + *      |                                 |
>> + * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
>> + *      |       bch_nvm_pgalloc_recs      |
>> + *      |  (nvm pages internal usage)     |
>> + * 24KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + *
>> + *
>> + * Memory layout on nvdimm namespace N
>> + * (doesn't have owner list)
>> + *
>> + *    0 +---------------------------------+
>> + *      |                                 |
>> + *  4KB +---------------------------------+
>> + *      |         bch_nvm_pages_sb        |
>> + *  8KB +---------------------------------+
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + *      |                                 |
>> + * 16MB  +---------------------------------+
>> + *      |      allocable nvm pages        |
>> + *      |      for buddy allocator        |
>> + * end  +---------------------------------+
>> + *
>> + */
>> +
>> +#include <linux/types.h>
>> +
>> +/* In sectors */
>> +#define BCH_NVM_PAGES_SB_OFFSET			4096
>> +#define BCH_NVM_PAGES_OFFSET			(16 << 20)
>> +
>> +#define BCH_NVM_PAGES_LABEL_SIZE		32
>> +#define BCH_NVM_PAGES_NAMESPACES_MAX		8
>> +
>> +#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
>> +#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
>> +
>> +#define BCH_NVM_PAGES_SB_VERSION		0
>> +#define BCH_NVM_PAGES_SB_VERSION_MAX		0
>> +
>> +static const unsigned char bch_nvm_pages_magic[] = {
>> +	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
>> +	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
>> +static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
>> +	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
>> +	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
>> +
>> +/* takes 64bit width */
>> +struct bch_pgalloc_rec {
>> +	__u64	pgoff:52;
>> +	__u64	order:6;
>> +	__u64	reserved:6;
>> +};
>> +
>> +struct bch_nvm_pgalloc_recs {
>> +union {
> Indentation.

Copied. It will be updated in next post.

>
>> +	struct {
>> +		struct bch_nvm_pages_owner_head	*owner;
>> +		struct bch_nvm_pgalloc_recs	*next;
>> +		unsigned char			magic[16];
>> +		unsigned char			owner_uuid[16];
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_pgalloc_rec		recs[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +
> Consider using __u64 and friends when specifying a structure with a
> fixed alignment; that also removes the need of the BITS_PER_LONG ifdef
> at the top.

It _WAS_ the first version how we did. But Jens didn't agree with this:

"This doesn't look right in a user header, any user API should be 32-bit
and 64-bit agnostic."

My following explanation was not convinced, then I change all the
__u32, __u64 stuffs into unsigned int, and unsigned long.

Considering nvm-pages allocator works only for both register word and
physical address are all 64 bits width, unsigned long will exactly
match 64bit and unsigned int will exactly match 32bit, I am fine to
use any of the forms. Jens is the upper layer maintainer, I choose to
listen to him.


>> +#define BCH_MAX_RECS					\
>> +	((sizeof(struct bch_nvm_pgalloc_recs) -		\
>> +	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
>> +	 sizeof(struct bch_pgalloc_rec))
>> +
> What _are_ you doing here?

BCH_MAX_RECS is a consistent value, to indicate how many elements can
be stored in array recs[] from struct bch_nvm_pgalloc_recs.

> You're not seriously using the 'pad' field as a placeholder to size the
> structure accordingly?

The code is in the way as you expected. The 8KB pad forces struct
bch_nvm_pgalloc_recs to be
two 4K pages, which is an ordered size from the nvm-pages buddy allocator.


> Also, what is the size of the 'bch_nvm_pgalloc_recs' structure?
> 8k + header size?

struct  bch_nvm_pgalloc_recs is exactly 8K. The header is inside the 8K
space.


> That is very awkward, as the page allocator won't be able to handle it
> efficiently.
> Please size it to either 8k or 16k overall.
> And if you do that you can simplify this define.

In memory layout of struct bch_nvm_pgalloc_recs is

|<------------ 8K ----------->|
[header] [recs ...............]
         |<-- BCH_MAX_RECS -->|


So the code works same as your expectation.

>
>> +struct bch_nvm_pages_owner_head {
>> +	unsigned char			uuid[16];
>> +	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
>> +	/* Per-namespace own lists */
>> +	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
>> +};
>> +
>> +/* heads[0] is always for nvm_pages internal usage */
>> +struct bch_owner_list_head {
>> +union {
>> +	struct {
>> +		unsigned int			size;
>> +		unsigned int			used;
>> +		unsigned long			_pad[4];
>> +		struct bch_nvm_pages_owner_head	heads[];
>> +	};
>> +	unsigned char				pad[8192];
>> +};
>> +};
>> +#define BCH_MAX_OWNER_LIST				\
>> +	((sizeof(struct bch_owner_list_head) -		\
>> +	 offsetof(struct bch_owner_list_head, heads)) /	\
>> +	 sizeof(struct bch_nvm_pages_owner_head))
>> +
> Same here.
> Please size it that the 'bch_owner_list_head' structure fits into either
> 8k or 16k.

It works as you expected. But I realize maybe the indent is misleading.
I should add a blank before offsetof(), like this,

163 #define BCH_MAX_OWNER_LIST                              \
164         ((sizeof(struct bch_owner_list_head) -          \
165           offsetof(struct bch_owner_list_head, heads)) /\
166          sizeof(struct bch_nvm_pages_owner_head))

It means (8K - header_size) / sizeof(struct bch_nvm_pages_owner_head).

>
>> +/* The on-media bit order is local CPU order */
>> +struct bch_nvm_pages_sb {
>> +	unsigned long				csum;
>> +	unsigned long				ns_start;
>> +	unsigned long				sb_offset;
>> +	unsigned long				version;
>> +	unsigned char				magic[16];
>> +	unsigned char				uuid[16];
>> +	unsigned int				page_size;
>> +	unsigned int				total_namespaces_nr;
>> +	unsigned int				this_namespace_nr;
>> +	union {
>> +		unsigned char			set_uuid[16];
>> +		unsigned long			set_magic;
>> +	};
>> +
>> +	unsigned long				flags;
>> +	unsigned long				seq;
>> +
>> +	unsigned long				feature_compat;
>> +	unsigned long				feature_incompat;
>> +	unsigned long				feature_ro_compat;
>> +
>> +	/* For allocable nvm pages from buddy systems */
>> +	unsigned long				pages_offset;
>> +	unsigned long				pages_total;
>> +
>> +	unsigned long				pad[8];
>> +
>> +	/* Only on the first name space */
>> +	struct bch_owner_list_head		*owner_list_head;
>> +
>> +	/* Just for csum_set() */
>> +	unsigned int				keys;
>> +	unsigned long				d[0];
>> +};
>>

Thanks for your review. I will update all addressed locations except for
the __u32/__u64 stuffs, because Jens
didn't want them.

Coly Li
Coly Li June 23, 2021, 7:19 a.m. UTC | #8
On 6/23/21 3:04 PM, Christoph Hellwig wrote:
> Storing a pointer on-media is completely broken.  It is not endian
> clean, not 32-bit vs 64-bit clean and will lead to problems when addresses

Why it is not endian clean, and not 32-bit vs. 64 bit clean for bcache ?
Bcache does not support endian clean indeed, and libnvdimm only works with
64bit physical address width. The only restriction here by using pointer is
the CPU register word should be 64bits, because we use the NVDIMM as memory.

Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
As a non-volatiled memory.

> change.  And they will change - maybe not often with DDR-attached
> memory, but very certainly with CXL-attached memory that is completely
> hot pluggable.

Does the already mapped DAX base address change in runtime during memory
hot plugable ?
If not, it won't be a problem here for this specific use case.

Coly Li
Christoph Hellwig June 23, 2021, 7:21 a.m. UTC | #9
On Wed, Jun 23, 2021 at 03:19:11PM +0800, Coly Li wrote:
> Bcache does not support endian clean indeed,

Then we need to fix that eventually rather than making it worse.  Which
means any _new_ data structure should start that way.

> and libnvdimm only works with
> 64bit physical address width.

Maybe it does right now.  But ther is nothing fundamental in that, so
please don't design stupid on-disk formats to encode that are going to
come back to bite us sooner or later.  Be that by adding 32-bit support
for any Linux DAX device, or by new 96 or 128bit CPUs.

> The only restriction here by using pointer is
> the CPU register word should be 64bits, because we use the NVDIMM as memory.
> 
> Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
> As a non-volatiled memory.

Not for on-disk data structures.

> Does the already mapped DAX base address change in runtime during memory
> hot plugable ?
> If not, it won't be a problem here for this specific use case.

It could change between one use and another.
Coly Li June 23, 2021, 10:05 a.m. UTC | #10
On 6/23/21 3:21 PM, Christoph Hellwig wrote:
> On Wed, Jun 23, 2021 at 03:19:11PM +0800, Coly Li wrote:
>> Bcache does not support endian clean indeed,
> Then we need to fix that eventually rather than making it worse.  Which
> means any _new_ data structure should start that way.

The cache device (typically SSD) of bcache is designed to dedicate to a
single local machine. Any
storage migration between machines with different endians should firstly
flush the dirty data to
backing hard drive. For bcache meta-data on cache device (typically
SSD), it is designed to NOT
move dirty cache between machines with different endians, and in
practice there is no such use
case indeed and not supported by any Linux Distribution.

Not supporting different endian is as design, why we should fix it for
no real use case ?

BTW, the discussion is only for cache device because the bcache meta
data stored on it. For
backing hard drive, its endian is transparent to bcache and decided by
upper layer code like
file system or user space application, it is fully endian clean.


>> and libnvdimm only works with
>> 64bit physical address width.
> Maybe it does right now.  But ther is nothing fundamental in that, so
> please don't design stupid on-disk formats to encode that are going to
> come back to bite us sooner or later.  Be that by adding 32-bit support
> for any Linux DAX device, or by new 96 or 128bit CPUs.

This is unfair restriction :-)
The nvdimm support for bcache heavily depends on libnvdimm, that is, for
all conditions that libnvdimm
supports we should follow up. But requiring us to support the condition
that even libnvdimm does not
support yet, it is too early at this stage.

And, if libnvdimm (not DAX) supports 32-bit or new 96 or 128bit CPUs,
considering the data sturectures
are arrays and single lists,  it won't be too complicated to follow up.

>> The only restriction here by using pointer is
>> the CPU register word should be 64bits, because we use the NVDIMM as memory.
>>
>> Is it one of the way how NVDIMM (especially Intel AEP) designed to use ?
>> As a non-volatiled memory.
> Not for on-disk data structures.

This is not on-disk data structure. We use the NVDIMM as memory, and
access the internal data
structures as current existing code does onto DRAM.

Please encourage us to have a series try with this might-be-different idea.

>> Does the already mapped DAX base address change in runtime during memory
>> hot plugable ?
>> If not, it won't be a problem here for this specific use case.
> It could change between one use and another.

Hmm, I don't understand the implicit meaning of the above line.
Could you please offer a detail example ?


Thank you for looking at this and provide value comment. All the above
response is not argument or
stubbornness, I do want to have a clear understand by the discussion
with you, that we won't regret
in future for current design.

Coly Li
Coly Li June 23, 2021, 11:16 a.m. UTC | #11
On 6/23/21 6:05 PM, Coly Li wrote:
> On 6/23/21 3:21 PM, Christoph Hellwig wrote:
> Does the already mapped DAX base address change in runtime during memory
> hot plugable ?
> If not, it won't be a problem here for this specific use case.
>> It could change between one use and another.
> Hmm, I don't understand the implicit meaning of the above line.
> Could you please offer a detail example ?
>

Hi Christoph,

I feel I come to understand "It could change between
one use and another." Yes, this is a problem for full
pointer design. I will switch to base + offset format.

Thank you for joining the discussion and provide your
comments, to help me have improved understand for better
design :-)

Coly Li
Christoph Hellwig June 23, 2021, 11:49 a.m. UTC | #12
On Wed, Jun 23, 2021 at 06:05:51PM +0800, Coly Li wrote:
> The cache device (typically SSD) of bcache is designed to dedicate to a
> single local machine. Any
> storage migration between machines with different endians should firstly
> flush the dirty data to
> backing hard drive.

Now my G5 died and I need to recover the data using my x86 laptop,
what am I going to do?

> >> If not, it won't be a problem here for this specific use case.
> > It could change between one use and another.
> 
> Hmm, I don't understand the implicit meaning of the above line.
> Could you please offer a detail example ?

There is no guarantee you nvdimm or CXL memory device will show up
at the same address.
Coly Li June 23, 2021, 12:09 p.m. UTC | #13
On 6/23/21 7:49 PM, Christoph Hellwig wrote:
> On Wed, Jun 23, 2021 at 06:05:51PM +0800, Coly Li wrote:
>> The cache device (typically SSD) of bcache is designed to dedicate to a
>> single local machine. Any
>> storage migration between machines with different endians should firstly
>> flush the dirty data to
>> backing hard drive.
> Now my G5 died and I need to recover the data using my x86 laptop,
> what am I going to do?
>
>>>> If not, it won't be a problem here for this specific use case.
>>> It could change between one use and another.
>> Hmm, I don't understand the implicit meaning of the above line.
>> Could you please offer a detail example ?
> There is no guarantee you nvdimm or CXL memory device will show up
> at the same address.

Copied, I fully understand. Now I am working on the full pointer to
[base + offset] convert.

Thanks for your patient explanation :-)

Coly Li
diff mbox series

Patch

diff --git a/include/uapi/linux/bcache-nvm.h b/include/uapi/linux/bcache-nvm.h
new file mode 100644
index 000000000000..5094a6797679
--- /dev/null
+++ b/include/uapi/linux/bcache-nvm.h
@@ -0,0 +1,200 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef _UAPI_BCACHE_NVM_H
+#define _UAPI_BCACHE_NVM_H
+
+#if (__BITS_PER_LONG == 64)
+/*
+ * Bcache on NVDIMM data structures
+ */
+
+/*
+ * - struct bch_nvm_pages_sb
+ *   This is the super block allocated on each nvdimm namespace. A nvdimm
+ * set may have multiple namespaces, bch_nvm_pages_sb->set_uuid is used to mark
+ * which nvdimm set this name space belongs to. Normally we will use the
+ * bcache's cache set UUID to initialize this uuid, to connect this nvdimm
+ * set to a specified bcache cache set.
+ *
+ * - struct bch_owner_list_head
+ *   This is a table for all heads of all owner lists. A owner list records
+ * which page(s) allocated to which owner. After reboot from power failure,
+ * the ownwer may find all its requested and allocated pages from the owner
+ * list by a handler which is converted by a UUID.
+ *
+ * - struct bch_nvm_pages_owner_head
+ *   This is a head of an owner list. Each owner only has one owner list,
+ * and a nvm page only belongs to an specific owner. uuid[] will be set to
+ * owner's uuid, for bcache it is the bcache's cache set uuid. label is not
+ * mandatory, it is a human-readable string for debug purpose. The pointer
+ * recs references to separated nvm page which hold the table of struct
+ * bch_pgalloc_rec.
+ *
+ *- struct bch_nvm_pgalloc_recs
+ *  This structure occupies a whole page, owner_uuid should match the uuid
+ * in struct bch_nvm_pages_owner_head. recs[] is the real table contains all
+ * allocated records.
+ *
+ * - struct bch_pgalloc_rec
+ *   Each structure records a range of allocated nvm pages. pgoff is offset
+ * in unit of page size of this allocated nvm page range. The adjoint page
+ * ranges of same owner can be merged into a larger one, therefore pages_nr
+ * is NOT always power of 2.
+ *
+ *
+ * Memory layout on nvdimm namespace 0
+ *
+ *    0 +---------------------------------+
+ *      |                                 |
+ *  4KB +---------------------------------+
+ *      |         bch_nvm_pages_sb        |
+ *  8KB +---------------------------------+ <--- bch_nvm_pages_sb.bch_owner_list_head
+ *      |       bch_owner_list_head       |
+ *      |                                 |
+ * 16KB +---------------------------------+ <--- bch_owner_list_head.heads[0].recs[0]
+ *      |       bch_nvm_pgalloc_recs      |
+ *      |  (nvm pages internal usage)     |
+ * 24KB +---------------------------------+
+ *      |                                 |
+ *      |                                 |
+ * 16MB  +---------------------------------+
+ *      |      allocable nvm pages        |
+ *      |      for buddy allocator        |
+ * end  +---------------------------------+
+ *
+ *
+ *
+ * Memory layout on nvdimm namespace N
+ * (doesn't have owner list)
+ *
+ *    0 +---------------------------------+
+ *      |                                 |
+ *  4KB +---------------------------------+
+ *      |         bch_nvm_pages_sb        |
+ *  8KB +---------------------------------+
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ *      |                                 |
+ * 16MB  +---------------------------------+
+ *      |      allocable nvm pages        |
+ *      |      for buddy allocator        |
+ * end  +---------------------------------+
+ *
+ */
+
+#include <linux/types.h>
+
+/* In sectors */
+#define BCH_NVM_PAGES_SB_OFFSET			4096
+#define BCH_NVM_PAGES_OFFSET			(16 << 20)
+
+#define BCH_NVM_PAGES_LABEL_SIZE		32
+#define BCH_NVM_PAGES_NAMESPACES_MAX		8
+
+#define BCH_NVM_PAGES_OWNER_LIST_HEAD_OFFSET	(8<<10)
+#define BCH_NVM_PAGES_SYS_RECS_HEAD_OFFSET	(16<<10)
+
+#define BCH_NVM_PAGES_SB_VERSION		0
+#define BCH_NVM_PAGES_SB_VERSION_MAX		0
+
+static const unsigned char bch_nvm_pages_magic[] = {
+	0x17, 0xbd, 0x53, 0x7f, 0x1b, 0x23, 0xd6, 0x83,
+	0x46, 0xa4, 0xf8, 0x28, 0x17, 0xda, 0xec, 0xa9 };
+static const unsigned char bch_nvm_pages_pgalloc_magic[] = {
+	0x39, 0x25, 0x3f, 0xf7, 0x27, 0x17, 0xd0, 0xb9,
+	0x10, 0xe6, 0xd2, 0xda, 0x38, 0x68, 0x26, 0xae };
+
+/* takes 64bit width */
+struct bch_pgalloc_rec {
+	__u64	pgoff:52;
+	__u64	order:6;
+	__u64	reserved:6;
+};
+
+struct bch_nvm_pgalloc_recs {
+union {
+	struct {
+		struct bch_nvm_pages_owner_head	*owner;
+		struct bch_nvm_pgalloc_recs	*next;
+		unsigned char			magic[16];
+		unsigned char			owner_uuid[16];
+		unsigned int			size;
+		unsigned int			used;
+		unsigned long			_pad[4];
+		struct bch_pgalloc_rec		recs[];
+	};
+	unsigned char				pad[8192];
+};
+};
+
+#define BCH_MAX_RECS					\
+	((sizeof(struct bch_nvm_pgalloc_recs) -		\
+	 offsetof(struct bch_nvm_pgalloc_recs, recs)) /	\
+	 sizeof(struct bch_pgalloc_rec))
+
+struct bch_nvm_pages_owner_head {
+	unsigned char			uuid[16];
+	unsigned char			label[BCH_NVM_PAGES_LABEL_SIZE];
+	/* Per-namespace own lists */
+	struct bch_nvm_pgalloc_recs	*recs[BCH_NVM_PAGES_NAMESPACES_MAX];
+};
+
+/* heads[0] is always for nvm_pages internal usage */
+struct bch_owner_list_head {
+union {
+	struct {
+		unsigned int			size;
+		unsigned int			used;
+		unsigned long			_pad[4];
+		struct bch_nvm_pages_owner_head	heads[];
+	};
+	unsigned char				pad[8192];
+};
+};
+#define BCH_MAX_OWNER_LIST				\
+	((sizeof(struct bch_owner_list_head) -		\
+	 offsetof(struct bch_owner_list_head, heads)) /	\
+	 sizeof(struct bch_nvm_pages_owner_head))
+
+/* The on-media bit order is local CPU order */
+struct bch_nvm_pages_sb {
+	unsigned long				csum;
+	unsigned long				ns_start;
+	unsigned long				sb_offset;
+	unsigned long				version;
+	unsigned char				magic[16];
+	unsigned char				uuid[16];
+	unsigned int				page_size;
+	unsigned int				total_namespaces_nr;
+	unsigned int				this_namespace_nr;
+	union {
+		unsigned char			set_uuid[16];
+		unsigned long			set_magic;
+	};
+
+	unsigned long				flags;
+	unsigned long				seq;
+
+	unsigned long				feature_compat;
+	unsigned long				feature_incompat;
+	unsigned long				feature_ro_compat;
+
+	/* For allocable nvm pages from buddy systems */
+	unsigned long				pages_offset;
+	unsigned long				pages_total;
+
+	unsigned long				pad[8];
+
+	/* Only on the first name space */
+	struct bch_owner_list_head		*owner_list_head;
+
+	/* Just for csum_set() */
+	unsigned int				keys;
+	unsigned long				d[0];
+};
+#endif /* __BITS_PER_LONG == 64 */
+
+#endif /* _UAPI_BCACHE_NVM_H */