libnvdimm: Clarify nd_pfn_init() flow
diff mbox series

Message ID 154785884329.2202034.3295476681016958230.stgit@dwillia2-desk3.amr.corp.intel.com
State New
Headers show
Series
  • libnvdimm: Clarify nd_pfn_init() flow
Related show

Commit Message

Dan Williams Jan. 19, 2019, 12:47 a.m. UTC
In recent days, 2 engineers, including the original author of
nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
implications to memory allocation.

Clarify this situation to help anyone that reads through this code in
the future.

Reported-by: Wei Yang <richardw.yang@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/btt_devs.c |    5 +++++
 drivers/nvdimm/dax_devs.c |    5 +++++
 drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
 3 files changed, 31 insertions(+)

Comments

Wei Yang Jan. 21, 2019, 7:51 a.m. UTC | #1
On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>In recent days, 2 engineers, including the original author of
>nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
>implications to memory allocation.
>
>Clarify this situation to help anyone that reads through this code in
>the future.
>
>Reported-by: Wei Yang <richardw.yang@linux.intel.com>
>Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>---
> drivers/nvdimm/btt_devs.c |    5 +++++
> drivers/nvdimm/dax_devs.c |    5 +++++
> drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
> 3 files changed, 31 insertions(+)
>
>diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
>index 795ad4ff35ca..e0a6f2491e57 100644
>--- a/drivers/nvdimm/btt_devs.c
>+++ b/drivers/nvdimm/btt_devs.c
>@@ -354,6 +354,11 @@ int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
> 		put_device(btt_dev);
> 	}
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_btt
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_btt_probe);
>diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
>index 0453f49dc708..65010d955fa7 100644
>--- a/drivers/nvdimm/dax_devs.c
>+++ b/drivers/nvdimm/dax_devs.c
>@@ -136,6 +136,11 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(dax_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that a device-dax
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_dax_probe);
>diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>index 6f22272e8d80..a8783b5a76ba 100644
>--- a/drivers/nvdimm/pfn_devs.c
>+++ b/drivers/nvdimm/pfn_devs.c
>@@ -576,6 +576,11 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(pfn_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_pfn
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_pfn_probe);
>@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> 		sig = DAX_SIG;
> 	else
> 		sig = PFN_SIG;
>+
>+	/*
>+	 * Check for an existing 'pfn' superblock before writing a new
>+	 * one. The intended flow is that on the first probe of an
>+	 * nd_{pfn,dax} device the superblock is calculated and written
>+	 * to the namespace. In this case nd_pfn_validate() returns
>+	 * -ENODEV because no valid superblock exists currently.
>+	 *
>+	 * On subsequent probes nd_pfn_validate() will find a valid
>+	 * superblock and return 0.
>+	 *
>+	 * If an assumption of the superblock has been violated, like a
>+	 * change to the physical alignment of the namespace,
>+	 * nd_pfn_validate() will return an error other than -ENODEV to
>+	 * fail probing.
>+	 */

Let me reply in this thread. Sorry for my poor understand, I don't get it
clearly now.

To be honest, the structure is a little bit complicated, if my understanding
is not correct, please forgive my poor understand.

Below is a code flow. To simply analysis, I setup kernel parameter memmap to
emulate, and configure one namespace to mode devdax. So that we would have the
same root for code flow.

Let's start with nd_region_driver:

    nd_region_probe
        nd_region_register_namespaces
	    create_namespaces
	nd_region->btt_seed = nd_btt_create(nd_region);
	nd_region->pfn_seed = nd_pfn_create(nd_region);
	nd_region->dax_seed = nd_dax_create(nd_region);

After this, there are 4 devices created:

	namespace0.0, btt0.0, pfn0.0, dax0.0

And there are two drivers related to these devices. The relationship between
devices and drivers are:

	nd_pmem_driver: namespace0.0, btt0.0, pfn0.0
	dax_pmem_driver: dax0.0

Only the probe function on namespace0.0 succeed. Even others get -ENODEV,
those devices themself is not released.

Then let's look at the probe on namespace0.0:

    nd_pmem_probe
        nd_btt_probe
	nd_pfn_probe
	nd_dax_probe

When namespace is configured as devdax, only nd_dax_probe do some real work.

Then I see some different behavior as your description.

    * nd_dax_probe->nd_pfn_validate() return 0 instead of -ENODEV.
    * so device dax0.1 is created
    * dax_pmem_probe is called on dax0.1 and nd_pfn_validate() return 0 too

This means pfn_sb is created twice in following functions:

    * nd_dax_probe
    * dax_pmem_probe

Also, I have one confusion about your saying: two probes.

If the two probes are:

    * for dax%d.%d: 1. nd_dax_probe 2. dax_pmem_probe   
    * for pfn%d.%d: 1. nd_pfn_probe 2. nd_pmem_probe    

Then, if the first probe fails, the device itself would be destroyed. How the
second probe do its job?

> 	rc = nd_pfn_validate(nd_pfn, sig);
> 	if (rc != -ENODEV)
> 		return rc;
Dan Williams Jan. 21, 2019, 6:04 p.m. UTC | #2
On Sun, Jan 20, 2019 at 11:51 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
> On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
[..]
> Also, I have one confusion about your saying: two probes.
>
> If the two probes are:
>
>     * for dax%d.%d: 1. nd_dax_probe 2. dax_pmem_probe
>     * for pfn%d.%d: 1. nd_pfn_probe 2. nd_pmem_probe
>
> Then, if the first probe fails, the device itself would be destroyed. How the
> second probe do its job?
>
> >       rc = nd_pfn_validate(nd_pfn, sig);
> >       if (rc != -ENODEV)
> >               return rc;

Here is an example path for a device-dax instance:

    /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.1/dax0.0

In this case the order of events is:

1/ region0 discovers it contains a pmem namespace and registers namespace0.0
2/ The pmem namespace driver calls nd_dax_probe() to check for the
presence of a device-dax configuration
3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
registers the dax0.1 device (this is a libnvdimm 'personality device).
4/ When nd_pmem_probe() sees nd_dax_probe() return 0 it in turn fails
the probe of namespace0.0 with -ENXIO. All devm allocations during the
probe of namespace0.0 are released.
5/ Meanwhile the created dax0.1 device is recognized by the dax_pmem device.
6/ dax_pmem_probe() uses nd_pfn_init() to validate the configuration
and ultimately creates the dax0.0 character device

The pmem namepsace driver is effectively a router to other personality
drivers that may be layered on top of the base namespace capacity.
Wei Yang Jan. 21, 2019, 8:57 p.m. UTC | #3
On Mon, Jan 21, 2019 at 10:04:40AM -0800, Dan Williams wrote:
>On Sun, Jan 20, 2019 at 11:51 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>> On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>[..]
>> Also, I have one confusion about your saying: two probes.
>>
>> If the two probes are:
>>
>>     * for dax%d.%d: 1. nd_dax_probe 2. dax_pmem_probe
>>     * for pfn%d.%d: 1. nd_pfn_probe 2. nd_pmem_probe
>>
>> Then, if the first probe fails, the device itself would be destroyed. How the
>> second probe do its job?
>>
>> >       rc = nd_pfn_validate(nd_pfn, sig);
>> >       if (rc != -ENODEV)
>> >               return rc;
>
>Here is an example path for a device-dax instance:
>
>    /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.1/dax0.0
>
>In this case the order of events is:
>
>1/ region0 discovers it contains a pmem namespace and registers namespace0.0
>2/ The pmem namespace driver calls nd_dax_probe() to check for the
>presence of a device-dax configuration
>3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
>registers the dax0.1 device (this is a libnvdimm 'personality device).
>4/ When nd_pmem_probe() sees nd_dax_probe() return 0 it in turn fails
>the probe of namespace0.0 with -ENXIO. All devm allocations during the
>probe of namespace0.0 are released.

I may have another opinion here.

The probe return error means the device will not attach to this driver.
But the device itself it not released.

We allocate devm on one device and those memory will be released when
the device is destroyed. If I am correct.

This means at this point, pfn_sb's memory still exists in the system.
Even finally we will release it, when namespace0.0 is destroyed.

>5/ Meanwhile the created dax0.1 device is recognized by the dax_pmem device.
>6/ dax_pmem_probe() uses nd_pfn_init() to validate the configuration
>and ultimately creates the dax0.0 character device
>
>The pmem namepsace driver is effectively a router to other personality
>drivers that may be layered on top of the base namespace capacity.
Dan Williams Jan. 21, 2019, 10:34 p.m. UTC | #4
On Mon, Jan 21, 2019 at 12:57 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, Jan 21, 2019 at 10:04:40AM -0800, Dan Williams wrote:
> >On Sun, Jan 20, 2019 at 11:51 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
> >> On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
> >[..]
> >> Also, I have one confusion about your saying: two probes.
> >>
> >> If the two probes are:
> >>
> >>     * for dax%d.%d: 1. nd_dax_probe 2. dax_pmem_probe
> >>     * for pfn%d.%d: 1. nd_pfn_probe 2. nd_pmem_probe
> >>
> >> Then, if the first probe fails, the device itself would be destroyed. How the
> >> second probe do its job?
> >>
> >> >       rc = nd_pfn_validate(nd_pfn, sig);
> >> >       if (rc != -ENODEV)
> >> >               return rc;
> >
> >Here is an example path for a device-dax instance:
> >
> >    /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.1/dax0.0
> >
> >In this case the order of events is:
> >
> >1/ region0 discovers it contains a pmem namespace and registers namespace0.0
> >2/ The pmem namespace driver calls nd_dax_probe() to check for the
> >presence of a device-dax configuration
> >3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
> >registers the dax0.1 device (this is a libnvdimm 'personality device).
> >4/ When nd_pmem_probe() sees nd_dax_probe() return 0 it in turn fails
> >the probe of namespace0.0 with -ENXIO. All devm allocations during the
> >probe of namespace0.0 are released.
>
> I may have another opinion here.
>
> The probe return error means the device will not attach to this driver.
> But the device itself it not released.
>
> We allocate devm on one device and those memory will be released when
> the device is destroyed. If I am correct.
>
> This means at this point, pfn_sb's memory still exists in the system.
> Even finally we will release it, when namespace0.0 is destroyed.

No, that's not the way devm works. Memory allocated by devm is
released at ->probe() failure, or after ->remove()

See the devres_release_all() call in drivers/base/dd.c::really_probe()
Wei Yang Jan. 22, 2019, 12:23 a.m. UTC | #5
On Mon, Jan 21, 2019 at 02:34:06PM -0800, Dan Williams wrote:
>On Mon, Jan 21, 2019 at 12:57 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>>
>> On Mon, Jan 21, 2019 at 10:04:40AM -0800, Dan Williams wrote:
>> >On Sun, Jan 20, 2019 at 11:51 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>> >> On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>> >[..]
>> >> Also, I have one confusion about your saying: two probes.
>> >>
>> >> If the two probes are:
>> >>
>> >>     * for dax%d.%d: 1. nd_dax_probe 2. dax_pmem_probe
>> >>     * for pfn%d.%d: 1. nd_pfn_probe 2. nd_pmem_probe
>> >>
>> >> Then, if the first probe fails, the device itself would be destroyed. How the
>> >> second probe do its job?
>> >>
>> >> >       rc = nd_pfn_validate(nd_pfn, sig);
>> >> >       if (rc != -ENODEV)
>> >> >               return rc;
>> >
>> >Here is an example path for a device-dax instance:
>> >
>> >    /sys/devices/platform/e820_pmem/ndbus0/region0/dax0.1/dax0.0
>> >
>> >In this case the order of events is:
>> >
>> >1/ region0 discovers it contains a pmem namespace and registers namespace0.0
>> >2/ The pmem namespace driver calls nd_dax_probe() to check for the
>> >presence of a device-dax configuration
>> >3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
>> >registers the dax0.1 device (this is a libnvdimm 'personality device).
>> >4/ When nd_pmem_probe() sees nd_dax_probe() return 0 it in turn fails
>> >the probe of namespace0.0 with -ENXIO. All devm allocations during the
>> >probe of namespace0.0 are released.
>>
>> I may have another opinion here.
>>
>> The probe return error means the device will not attach to this driver.
>> But the device itself it not released.
>>
>> We allocate devm on one device and those memory will be released when
>> the device is destroyed. If I am correct.
>>
>> This means at this point, pfn_sb's memory still exists in the system.
>> Even finally we will release it, when namespace0.0 is destroyed.
>
>No, that's not the way devm works. Memory allocated by devm is
>released at ->probe() failure, or after ->remove()
>
>See the devres_release_all() call in drivers/base/dd.c::really_probe()

Ah, you are right.
Wei Yang Jan. 22, 2019, 12:26 a.m. UTC | #6
On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>In recent days, 2 engineers, including the original author of
>nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
>implications to memory allocation.
>
>Clarify this situation to help anyone that reads through this code in
>the future.
>
>Reported-by: Wei Yang <richardw.yang@linux.intel.com>
>Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>---
> drivers/nvdimm/btt_devs.c |    5 +++++
> drivers/nvdimm/dax_devs.c |    5 +++++
> drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
> 3 files changed, 31 insertions(+)
>
>diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
>index 795ad4ff35ca..e0a6f2491e57 100644
>--- a/drivers/nvdimm/btt_devs.c
>+++ b/drivers/nvdimm/btt_devs.c
>@@ -354,6 +354,11 @@ int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
> 		put_device(btt_dev);
> 	}
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_btt
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_btt_probe);
>diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
>index 0453f49dc708..65010d955fa7 100644
>--- a/drivers/nvdimm/dax_devs.c
>+++ b/drivers/nvdimm/dax_devs.c
>@@ -136,6 +136,11 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(dax_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that a device-dax
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_dax_probe);
>diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>index 6f22272e8d80..a8783b5a76ba 100644
>--- a/drivers/nvdimm/pfn_devs.c
>+++ b/drivers/nvdimm/pfn_devs.c
>@@ -576,6 +576,11 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(pfn_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_pfn
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_pfn_probe);
>@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> 		sig = DAX_SIG;
> 	else
> 		sig = PFN_SIG;
>+
>+	/*
>+	 * Check for an existing 'pfn' superblock before writing a new
>+	 * one. The intended flow is that on the first probe of an
>+	 * nd_{pfn,dax} device the superblock is calculated and written
>+	 * to the namespace. In this case nd_pfn_validate() returns
>+	 * -ENODEV because no valid superblock exists currently.

As you replied in following mail:

3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
registers the dax0.1 device (this is a libnvdimm 'personality device).

So at this point, nd_pfn_validate() return 0 or -ENODEV?

>+	 *
>+	 * On subsequent probes nd_pfn_validate() will find a valid
>+	 * superblock and return 0.
>+	 *
>+	 * If an assumption of the superblock has been violated, like a
>+	 * change to the physical alignment of the namespace,
>+	 * nd_pfn_validate() will return an error other than -ENODEV to
>+	 * fail probing.
>+	 */
> 	rc = nd_pfn_validate(nd_pfn, sig);
> 	if (rc != -ENODEV)
> 		return rc;
Dan Williams Jan. 22, 2019, 12:29 a.m. UTC | #7
On Mon, Jan 21, 2019 at 4:26 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
[..]
> >@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >               sig = DAX_SIG;
> >       else
> >               sig = PFN_SIG;
> >+
> >+      /*
> >+       * Check for an existing 'pfn' superblock before writing a new
> >+       * one. The intended flow is that on the first probe of an
> >+       * nd_{pfn,dax} device the superblock is calculated and written
> >+       * to the namespace. In this case nd_pfn_validate() returns
> >+       * -ENODEV because no valid superblock exists currently.
>
> As you replied in following mail:
>
> 3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
> registers the dax0.1 device (this is a libnvdimm 'personality device).
>
> So at this point, nd_pfn_validate() return 0 or -ENODEV?

In this case 0, because the configuration was successfully validated.

-ENODEV, is only returned for the initial case where we want the
kernel to write the configuration.

All other error codes are an actual failure and the probe procedure stops.
Wei Yang Jan. 22, 2019, 12:29 a.m. UTC | #8
On Mon, Jan 21, 2019 at 03:51:02PM +0800, Wei Yang wrote:
>On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>>In recent days, 2 engineers, including the original author of
>>nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
>>implications to memory allocation.
>>
>>Clarify this situation to help anyone that reads through this code in
>>the future.
>>
>>Reported-by: Wei Yang <richardw.yang@linux.intel.com>
>>Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>---
>> drivers/nvdimm/btt_devs.c |    5 +++++
>> drivers/nvdimm/dax_devs.c |    5 +++++
>> drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
>> 3 files changed, 31 insertions(+)
>>
>>diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
>>index 795ad4ff35ca..e0a6f2491e57 100644
>>--- a/drivers/nvdimm/btt_devs.c
>>+++ b/drivers/nvdimm/btt_devs.c
>>@@ -354,6 +354,11 @@ int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
>> 		put_device(btt_dev);
>> 	}
>> 
>>+	/*
>>+	 * Successful probe indicates to the caller that an nd_btt
>>+	 * personality device has been registered and the caller can
>>+	 * fail the probe of the baseline namespace device.
>>+	 */
>> 	return rc;
>> }
>> EXPORT_SYMBOL(nd_btt_probe);
>>diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
>>index 0453f49dc708..65010d955fa7 100644
>>--- a/drivers/nvdimm/dax_devs.c
>>+++ b/drivers/nvdimm/dax_devs.c
>>@@ -136,6 +136,11 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
>> 	} else
>> 		__nd_device_register(dax_dev);
>> 
>>+	/*
>>+	 * Successful probe indicates to the caller that a device-dax
>>+	 * personality device has been registered and the caller can
>>+	 * fail the probe of the baseline namespace device.
>>+	 */
>> 	return rc;
>> }
>> EXPORT_SYMBOL(nd_dax_probe);
>>diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>>index 6f22272e8d80..a8783b5a76ba 100644
>>--- a/drivers/nvdimm/pfn_devs.c
>>+++ b/drivers/nvdimm/pfn_devs.c
>>@@ -576,6 +576,11 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
>> 	} else
>> 		__nd_device_register(pfn_dev);
>> 
>>+	/*
>>+	 * Successful probe indicates to the caller that an nd_pfn
>>+	 * personality device has been registered and the caller can
>>+	 * fail the probe of the baseline namespace device.
>>+	 */
>> 	return rc;
>> }
>> EXPORT_SYMBOL(nd_pfn_probe);
>>@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>> 		sig = DAX_SIG;
>> 	else
>> 		sig = PFN_SIG;
>>+
>>+	/*
>>+	 * Check for an existing 'pfn' superblock before writing a new
>>+	 * one. The intended flow is that on the first probe of an
>>+	 * nd_{pfn,dax} device the superblock is calculated and written
>>+	 * to the namespace. In this case nd_pfn_validate() returns
>>+	 * -ENODEV because no valid superblock exists currently.
>>+	 *
>>+	 * On subsequent probes nd_pfn_validate() will find a valid
>>+	 * superblock and return 0.
>>+	 *
>>+	 * If an assumption of the superblock has been violated, like a
>>+	 * change to the physical alignment of the namespace,
>>+	 * nd_pfn_validate() will return an error other than -ENODEV to
>>+	 * fail probing.
>>+	 */
>
>Let me reply in this thread. Sorry for my poor understand, I don't get it
>clearly now.
>
>To be honest, the structure is a little bit complicated, if my understanding
>is not correct, please forgive my poor understand.
>
>Below is a code flow. To simply analysis, I setup kernel parameter memmap to
>emulate, and configure one namespace to mode devdax. So that we would have the
>same root for code flow.
>
>Let's start with nd_region_driver:
>
>    nd_region_probe
>        nd_region_register_namespaces
>	    create_namespaces
>	nd_region->btt_seed = nd_btt_create(nd_region);
>	nd_region->pfn_seed = nd_pfn_create(nd_region);
>	nd_region->dax_seed = nd_dax_create(nd_region);

May I ask a question about the purpose to create these three device here?

I see nd_pfn_create() doesn't allocate pfn_sb here, and the probe on these
devices failed. Confused why we need these three devices.
Dan Williams Jan. 22, 2019, 12:45 a.m. UTC | #9
On Mon, Jan 21, 2019 at 4:30 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>
> On Mon, Jan 21, 2019 at 03:51:02PM +0800, Wei Yang wrote:
> >On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
> >>In recent days, 2 engineers, including the original author of
> >>nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
> >>implications to memory allocation.
> >>
> >>Clarify this situation to help anyone that reads through this code in
> >>the future.
> >>
> >>Reported-by: Wei Yang <richardw.yang@linux.intel.com>
> >>Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >>---
> >> drivers/nvdimm/btt_devs.c |    5 +++++
> >> drivers/nvdimm/dax_devs.c |    5 +++++
> >> drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
> >> 3 files changed, 31 insertions(+)
> >>
> >>diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
> >>index 795ad4ff35ca..e0a6f2491e57 100644
> >>--- a/drivers/nvdimm/btt_devs.c
> >>+++ b/drivers/nvdimm/btt_devs.c
> >>@@ -354,6 +354,11 @@ int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
> >>              put_device(btt_dev);
> >>      }
> >>
> >>+     /*
> >>+      * Successful probe indicates to the caller that an nd_btt
> >>+      * personality device has been registered and the caller can
> >>+      * fail the probe of the baseline namespace device.
> >>+      */
> >>      return rc;
> >> }
> >> EXPORT_SYMBOL(nd_btt_probe);
> >>diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
> >>index 0453f49dc708..65010d955fa7 100644
> >>--- a/drivers/nvdimm/dax_devs.c
> >>+++ b/drivers/nvdimm/dax_devs.c
> >>@@ -136,6 +136,11 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
> >>      } else
> >>              __nd_device_register(dax_dev);
> >>
> >>+     /*
> >>+      * Successful probe indicates to the caller that a device-dax
> >>+      * personality device has been registered and the caller can
> >>+      * fail the probe of the baseline namespace device.
> >>+      */
> >>      return rc;
> >> }
> >> EXPORT_SYMBOL(nd_dax_probe);
> >>diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> >>index 6f22272e8d80..a8783b5a76ba 100644
> >>--- a/drivers/nvdimm/pfn_devs.c
> >>+++ b/drivers/nvdimm/pfn_devs.c
> >>@@ -576,6 +576,11 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
> >>      } else
> >>              __nd_device_register(pfn_dev);
> >>
> >>+     /*
> >>+      * Successful probe indicates to the caller that an nd_pfn
> >>+      * personality device has been registered and the caller can
> >>+      * fail the probe of the baseline namespace device.
> >>+      */
> >>      return rc;
> >> }
> >> EXPORT_SYMBOL(nd_pfn_probe);
> >>@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >>              sig = DAX_SIG;
> >>      else
> >>              sig = PFN_SIG;
> >>+
> >>+     /*
> >>+      * Check for an existing 'pfn' superblock before writing a new
> >>+      * one. The intended flow is that on the first probe of an
> >>+      * nd_{pfn,dax} device the superblock is calculated and written
> >>+      * to the namespace. In this case nd_pfn_validate() returns
> >>+      * -ENODEV because no valid superblock exists currently.
> >>+      *
> >>+      * On subsequent probes nd_pfn_validate() will find a valid
> >>+      * superblock and return 0.
> >>+      *
> >>+      * If an assumption of the superblock has been violated, like a
> >>+      * change to the physical alignment of the namespace,
> >>+      * nd_pfn_validate() will return an error other than -ENODEV to
> >>+      * fail probing.
> >>+      */
> >
> >Let me reply in this thread. Sorry for my poor understand, I don't get it
> >clearly now.
> >
> >To be honest, the structure is a little bit complicated, if my understanding
> >is not correct, please forgive my poor understand.
> >
> >Below is a code flow. To simply analysis, I setup kernel parameter memmap to
> >emulate, and configure one namespace to mode devdax. So that we would have the
> >same root for code flow.
> >
> >Let's start with nd_region_driver:
> >
> >    nd_region_probe
> >        nd_region_register_namespaces
> >           create_namespaces
> >       nd_region->btt_seed = nd_btt_create(nd_region);
> >       nd_region->pfn_seed = nd_pfn_create(nd_region);
> >       nd_region->dax_seed = nd_dax_create(nd_region);
>
> May I ask a question about the purpose to create these three device here?
>
> I see nd_pfn_create() doesn't allocate pfn_sb here, and the probe on these
> devices failed. Confused why we need these three devices.

These allow for the namespace creation flow. Namespace creation with
the pfn personality, for example, goes like this:

1/ Find available region capacity

2/ Allocate capacity to a namespace

3/ Assign that namespace to a free pfn instance
echo namespace0.0 > /sys/bus/nd/devices/pfn0.0/namespace

4/ Start the pfn0.0 device
echo pfn0.0 > /sys/bus/nd/drivers/nd_pmem/bind

5/ At this point nd_pfn_validate() returns -ENODEV, so the driver
writes the configuration to the device to be autodetected next time.

This explicit force binding operation is how the driver determines the
difference between the namespace creation case and the namespace
auto-probing case at the next driver load.
Wei Yang Jan. 22, 2019, 12:46 a.m. UTC | #10
On Mon, Jan 21, 2019 at 04:29:08PM -0800, Dan Williams wrote:
>On Mon, Jan 21, 2019 at 4:26 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>[..]
>> >@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
>> >               sig = DAX_SIG;
>> >       else
>> >               sig = PFN_SIG;
>> >+
>> >+      /*
>> >+       * Check for an existing 'pfn' superblock before writing a new
>> >+       * one. The intended flow is that on the first probe of an
>> >+       * nd_{pfn,dax} device the superblock is calculated and written
>> >+       * to the namespace. In this case nd_pfn_validate() returns
>> >+       * -ENODEV because no valid superblock exists currently.
>>
>> As you replied in following mail:
>>
>> 3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
>> registers the dax0.1 device (this is a libnvdimm 'personality device).
>>
>> So at this point, nd_pfn_validate() return 0 or -ENODEV?
>
>In this case 0, because the configuration was successfully validated.
>
>-ENODEV, is only returned for the initial case where we want the
>kernel to write the configuration.
>
>All other error codes are an actual failure and the probe procedure stops.

To be honest, this maybe crystal clear for you. But I still feel a little
confused. Especially on differentiating those cases. How many cases we have? 

And what's your first probe mean? This the nd_btt/pfn/dax_probe()? or the
linux driver probe?
Dan Williams Jan. 22, 2019, 1 a.m. UTC | #11
On Mon, Jan 21, 2019 at 4:47 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>
> On Mon, Jan 21, 2019 at 04:29:08PM -0800, Dan Williams wrote:
> >On Mon, Jan 21, 2019 at 4:26 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
> >[..]
> >> >@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> >> >               sig = DAX_SIG;
> >> >       else
> >> >               sig = PFN_SIG;
> >> >+
> >> >+      /*
> >> >+       * Check for an existing 'pfn' superblock before writing a new
> >> >+       * one. The intended flow is that on the first probe of an
> >> >+       * nd_{pfn,dax} device the superblock is calculated and written
> >> >+       * to the namespace. In this case nd_pfn_validate() returns
> >> >+       * -ENODEV because no valid superblock exists currently.
> >>
> >> As you replied in following mail:
> >>
> >> 3/ If present, nd_pfn_validate() returns 0 and nd_dax_probe()
> >> registers the dax0.1 device (this is a libnvdimm 'personality device).
> >>
> >> So at this point, nd_pfn_validate() return 0 or -ENODEV?
> >
> >In this case 0, because the configuration was successfully validated.
> >
> >-ENODEV, is only returned for the initial case where we want the
> >kernel to write the configuration.
> >
> >All other error codes are an actual failure and the probe procedure stops.
>
> To be honest, this maybe crystal clear for you. But I still feel a little
> confused. Especially on differentiating those cases. How many cases we have?

There are 4 cases:

1/ namespace probed, no personality info-block found, namespace
results in 'raw' mode

2/ namespace probed, {pfn,btt,dax} info-block found, {pfn,btt,dax}
device created and attached to the corresponding personality driver

3/ namespace probing ends in failure like -ENOMEM, or a -EIO error
reading namespace capacity, for example due to a bad block.

4/ namespace force assigned to a {pfn,btt,dax} device instance,
{pfn,btt,dax} device force enabled  leading to the info-block being
written out to the device

> And what's your first probe mean? This the nd_btt/pfn/dax_probe()? or the
> linux driver probe?

The first probe is always nd_pmem_probe(). nd_{pfn,btt,dax}_probe()
are secondary.
Wei Yang Jan. 22, 2019, 3:13 a.m. UTC | #12
On Mon, Jan 21, 2019 at 04:45:46PM -0800, Dan Williams wrote:
>On Mon, Jan 21, 2019 at 4:30 PM Wei Yang <richardw.yang@linux.intel.com> wrote:
>>
>>
>> May I ask a question about the purpose to create these three device here?
>>
>> I see nd_pfn_create() doesn't allocate pfn_sb here, and the probe on these
>> devices failed. Confused why we need these three devices.
>
>These allow for the namespace creation flow. Namespace creation with
>the pfn personality, for example, goes like this:
>
>1/ Find available region capacity
>
>2/ Allocate capacity to a namespace
>
>3/ Assign that namespace to a free pfn instance
>echo namespace0.0 > /sys/bus/nd/devices/pfn0.0/namespace
>

Here we use __nd_attach_ndns() connect pfn0.0 with namespace0.0

>4/ Start the pfn0.0 device
>echo pfn0.0 > /sys/bus/nd/drivers/nd_pmem/bind
>

Here we trigger nd_pmem_probe() and call pmem_attach_disk() for device pfn0.0,
because this is a pfn device.

>5/ At this point nd_pfn_validate() returns -ENODEV, so the driver
>writes the configuration to the device to be autodetected next time.
>

Then it go through pmem_attach_disk()->nvdimm_setup_pfn()->nd_pfn_init().

The return value -ENODEV leads nd_pfn_init() to configuration write.

Everything looks good here, but I am lost at "Autodetected next time". Because
the driver will continue to finish its work and attach pfn0.0.

Do we need another autodetect? Or you mean after driver detach or reboot? So
nd_pfn_validate() will not return -ENODEV, as it has already configured well.

>This explicit force binding operation is how the driver determines the
>difference between the namespace creation case and the namespace
>auto-probing case at the next driver load.
Wei Yang Jan. 22, 2019, 3:35 a.m. UTC | #13
On Fri, Jan 18, 2019 at 04:47:23PM -0800, Dan Williams wrote:
>In recent days, 2 engineers, including the original author of
>nd_pfn_init(), overlooked the internal call to nd_pfn_validate() and the
>implications to memory allocation.
>
>Clarify this situation to help anyone that reads through this code in
>the future.
>
>Reported-by: Wei Yang <richardw.yang@linux.intel.com>
>Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>---
> drivers/nvdimm/btt_devs.c |    5 +++++
> drivers/nvdimm/dax_devs.c |    5 +++++
> drivers/nvdimm/pfn_devs.c |   21 +++++++++++++++++++++
> 3 files changed, 31 insertions(+)
>
>diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
>index 795ad4ff35ca..e0a6f2491e57 100644
>--- a/drivers/nvdimm/btt_devs.c
>+++ b/drivers/nvdimm/btt_devs.c
>@@ -354,6 +354,11 @@ int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
> 		put_device(btt_dev);
> 	}
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_btt
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_btt_probe);
>diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
>index 0453f49dc708..65010d955fa7 100644
>--- a/drivers/nvdimm/dax_devs.c
>+++ b/drivers/nvdimm/dax_devs.c
>@@ -136,6 +136,11 @@ int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(dax_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that a device-dax
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_dax_probe);
>diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
>index 6f22272e8d80..a8783b5a76ba 100644
>--- a/drivers/nvdimm/pfn_devs.c
>+++ b/drivers/nvdimm/pfn_devs.c
>@@ -576,6 +576,11 @@ int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
> 	} else
> 		__nd_device_register(pfn_dev);
> 
>+	/*
>+	 * Successful probe indicates to the caller that an nd_pfn
>+	 * personality device has been registered and the caller can
>+	 * fail the probe of the baseline namespace device.
>+	 */
> 	return rc;
> }
> EXPORT_SYMBOL(nd_pfn_probe);
>@@ -706,6 +711,22 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> 		sig = DAX_SIG;
> 	else
> 		sig = PFN_SIG;
>+
>+	/*
>+	 * Check for an existing 'pfn' superblock before writing a new
>+	 * one. The intended flow is that on the first probe of an
>+	 * nd_{pfn,dax} device the superblock is calculated and written
>+	 * to the namespace. In this case nd_pfn_validate() returns
>+	 * -ENODEV because no valid superblock exists currently.
>+	 *
>+	 * On subsequent probes nd_pfn_validate() will find a valid
>+	 * superblock and return 0.
>+	 *
>+	 * If an assumption of the superblock has been violated, like a
>+	 * change to the physical alignment of the namespace,
>+	 * nd_pfn_validate() will return an error other than -ENODEV to
>+	 * fail probing.
>+	 */

How about adjust this a little like:

    Check for an existing 'pfn' superblock before writing a new one.
    
    Return:
    
      -ENODEV: no valid superblock exists 
      0      : valid superblock exists
      other  : superblock violation, e.g. physical alignment change
    
    One superblock should be configured, before an nd_{pfn,dax} device be
    used properly.
    
    One newly create nd_{pfn,dax} device has no valid superblock. In this case
    nd_pfn_validate() returns -ENODEV to make driver continue and write
    configuration to superblock. 
    
    After proper configuration the first time, subsequent nd_pfn_validate()
    will find a valid superblock and return 0. So that driver will return
    immediately without configuring superblock again.
    
    An error other than -ENODEV means superblock violation, which fail
    probing.

> 	rc = nd_pfn_validate(nd_pfn, sig);
> 	if (rc != -ENODEV)
> 		return rc;

Patch
diff mbox series

diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 795ad4ff35ca..e0a6f2491e57 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -354,6 +354,11 @@  int nd_btt_probe(struct device *dev, struct nd_namespace_common *ndns)
 		put_device(btt_dev);
 	}
 
+	/*
+	 * Successful probe indicates to the caller that an nd_btt
+	 * personality device has been registered and the caller can
+	 * fail the probe of the baseline namespace device.
+	 */
 	return rc;
 }
 EXPORT_SYMBOL(nd_btt_probe);
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index 0453f49dc708..65010d955fa7 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -136,6 +136,11 @@  int nd_dax_probe(struct device *dev, struct nd_namespace_common *ndns)
 	} else
 		__nd_device_register(dax_dev);
 
+	/*
+	 * Successful probe indicates to the caller that a device-dax
+	 * personality device has been registered and the caller can
+	 * fail the probe of the baseline namespace device.
+	 */
 	return rc;
 }
 EXPORT_SYMBOL(nd_dax_probe);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 6f22272e8d80..a8783b5a76ba 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -576,6 +576,11 @@  int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
 	} else
 		__nd_device_register(pfn_dev);
 
+	/*
+	 * Successful probe indicates to the caller that an nd_pfn
+	 * personality device has been registered and the caller can
+	 * fail the probe of the baseline namespace device.
+	 */
 	return rc;
 }
 EXPORT_SYMBOL(nd_pfn_probe);
@@ -706,6 +711,22 @@  static int nd_pfn_init(struct nd_pfn *nd_pfn)
 		sig = DAX_SIG;
 	else
 		sig = PFN_SIG;
+
+	/*
+	 * Check for an existing 'pfn' superblock before writing a new
+	 * one. The intended flow is that on the first probe of an
+	 * nd_{pfn,dax} device the superblock is calculated and written
+	 * to the namespace. In this case nd_pfn_validate() returns
+	 * -ENODEV because no valid superblock exists currently.
+	 *
+	 * On subsequent probes nd_pfn_validate() will find a valid
+	 * superblock and return 0.
+	 *
+	 * If an assumption of the superblock has been violated, like a
+	 * change to the physical alignment of the namespace,
+	 * nd_pfn_validate() will return an error other than -ENODEV to
+	 * fail probing.
+	 */
 	rc = nd_pfn_validate(nd_pfn, sig);
 	if (rc != -ENODEV)
 		return rc;