diff mbox series

[V1,1/2] libnvdimm/nsio: differentiate between probe mapping and runtime mapping

Message ID 20191015153302.15750-1-aneesh.kumar@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series [V1,1/2] libnvdimm/nsio: differentiate between probe mapping and runtime mapping | expand

Commit Message

Aneesh Kumar K.V Oct. 15, 2019, 3:33 p.m. UTC
nvdimm core currently maps the full namespace to an ioremap range
while probing the namespace mode. This can result in probe failures
on architectures that have limited ioremap space.

nvdimm core can avoid this failure by only mapping the reserver block area to
check for pfn superblock type and map the full namespace resource only before
using the namespace. nvdimm core use ioremap range only for the raw and btt
namespace and we can limit the max namespace size for these two modes. For
both fsdax and devdax this change enables nvdimm to map namespace larger
that ioremap limit.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/nvdimm/blk.c      |  2 +-
 drivers/nvdimm/btt.c      | 13 ++++++++++++-
 drivers/nvdimm/claim.c    |  2 +-
 drivers/nvdimm/nd.h       |  2 +-
 drivers/nvdimm/pfn.h      |  6 ++++++
 drivers/nvdimm/pfn_devs.c |  5 -----
 drivers/nvdimm/pmem.c     |  2 +-
 7 files changed, 22 insertions(+), 10 deletions(-)

Comments

Dan Williams Oct. 15, 2019, 10:02 p.m. UTC | #1
On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> nvdimm core currently maps the full namespace to an ioremap range
> while probing the namespace mode. This can result in probe failures
> on architectures that have limited ioremap space.

Is there a #define for that limit?

> nvdimm core can avoid this failure by only mapping the reserver block area to
> check for pfn superblock type and map the full namespace resource only before
> using the namespace. nvdimm core use ioremap range only for the raw and btt
> namespace and we can limit the max namespace size for these two modes. For
> both fsdax and devdax this change enables nvdimm to map namespace larger
> that ioremap limit.

If the direct map has more space I think it would be better to add a
way to use that to map for all namespaces rather than introduce
arbitrary failures based on the mode.

I would buy a performance argument to avoid overmapping, but for
namespace access compatibility where an alternate mapping method would
succeed I think we should aim for that to be used instead. Thoughts?
Aneesh Kumar K.V Oct. 16, 2019, 5:29 a.m. UTC | #2
On 10/16/19 3:32 AM, Dan Williams wrote:
> On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> nvdimm core currently maps the full namespace to an ioremap range
>> while probing the namespace mode. This can result in probe failures
>> on architectures that have limited ioremap space.
> 
> Is there a #define for that limit?
> 

Arch specific #define. For example. ppc64 have different limits based on 
platform and translation mode. Hash translation with 4k PAGE_SIZE limit 
ioremap range to 8TB.

>> nvdimm core can avoid this failure by only mapping the reserver block area to
>> check for pfn superblock type and map the full namespace resource only before
>> using the namespace. nvdimm core use ioremap range only for the raw and btt
>> namespace and we can limit the max namespace size for these two modes. For
>> both fsdax and devdax this change enables nvdimm to map namespace larger
>> that ioremap limit.
> 
> If the direct map has more space I think it would be better to add a
> way to use that to map for all namespaces rather than introduce
> arbitrary failures based on the mode.
> 
> I would buy a performance argument to avoid overmapping, but for
> namespace access compatibility where an alternate mapping method would
> succeed I think we should aim for that to be used instead. Thoughts?
> 

That would require to have struct page allocated for these range and 
both raw and btt don't need a struct page backing?

-aneesh
Dan Williams Oct. 16, 2019, 4:04 p.m. UTC | #3
On Tue, Oct 15, 2019 at 10:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 10/16/19 3:32 AM, Dan Williams wrote:
> > On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> nvdimm core currently maps the full namespace to an ioremap range
> >> while probing the namespace mode. This can result in probe failures
> >> on architectures that have limited ioremap space.
> >
> > Is there a #define for that limit?
> >
>
> Arch specific #define. For example. ppc64 have different limits based on
> platform and translation mode. Hash translation with 4k PAGE_SIZE limit
> ioremap range to 8TB.
>
> >> nvdimm core can avoid this failure by only mapping the reserver block area to
> >> check for pfn superblock type and map the full namespace resource only before
> >> using the namespace. nvdimm core use ioremap range only for the raw and btt
> >> namespace and we can limit the max namespace size for these two modes. For
> >> both fsdax and devdax this change enables nvdimm to map namespace larger
> >> that ioremap limit.
> >
> > If the direct map has more space I think it would be better to add a
> > way to use that to map for all namespaces rather than introduce
> > arbitrary failures based on the mode.
> >
> > I would buy a performance argument to avoid overmapping, but for
> > namespace access compatibility where an alternate mapping method would
> > succeed I think we should aim for that to be used instead. Thoughts?
> >
>
> That would require to have struct page allocated for these range and
> both raw and btt don't need a struct page backing?
>

I was thinking a new mapping interface that just consumed direct-map
space, but did not allocate pages.
Aneesh Kumar K.V Oct. 16, 2019, 4:58 p.m. UTC | #4
On 10/16/19 9:34 PM, Dan Williams wrote:
> On Tue, Oct 15, 2019 at 10:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 10/16/19 3:32 AM, Dan Williams wrote:
>>> On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
>>> <aneesh.kumar@linux.ibm.com> wrote:
>>>>
>>>> nvdimm core currently maps the full namespace to an ioremap range
>>>> while probing the namespace mode. This can result in probe failures
>>>> on architectures that have limited ioremap space.
>>>
>>> Is there a #define for that limit?
>>>
>>
>> Arch specific #define. For example. ppc64 have different limits based on
>> platform and translation mode. Hash translation with 4k PAGE_SIZE limit
>> ioremap range to 8TB.
>>
>>>> nvdimm core can avoid this failure by only mapping the reserver block area to
>>>> check for pfn superblock type and map the full namespace resource only before
>>>> using the namespace. nvdimm core use ioremap range only for the raw and btt
>>>> namespace and we can limit the max namespace size for these two modes. For
>>>> both fsdax and devdax this change enables nvdimm to map namespace larger
>>>> that ioremap limit.
>>>
>>> If the direct map has more space I think it would be better to add a
>>> way to use that to map for all namespaces rather than introduce
>>> arbitrary failures based on the mode.
>>>
>>> I would buy a performance argument to avoid overmapping, but for
>>> namespace access compatibility where an alternate mapping method would
>>> succeed I think we should aim for that to be used instead. Thoughts?
>>>
>>
>> That would require to have struct page allocated for these range and
>> both raw and btt don't need a struct page backing?
>>
> 
> I was thinking a new mapping interface that just consumed direct-map
> space, but did not allocate pages.
> 

Not sure how easy that would be. We are looking at having part of 
direct-map address not managed by any zone and then possibly archs need 
to be taught to handle these ? (for example for ppc64 we "bolt" direct 
map range where as we allow taking low level hash fault for I/O remap range)

Even though you don't consider the patch as complete, considering the 
approach you outlined would require larger changes, do you think this 
patch can be accepted as a bug fix? Right now we can fail namespace
initialization during boot or ndctl enable-namespace all.

For example with ppc64 and I/O remap range limit of 8TB, we can 
individually create a 6TB namespace. We also allow to create multiple 
such namespaces. But if we try to enable them all together using ndctl 
enable-namespace all, that will fail with error

[   54.259910] vmap allocation for size x failed: use vmalloc=<size> to 
increase size

because we probe these namespaces in parallel.

-aneesh
Dan Williams Oct. 16, 2019, 7:05 p.m. UTC | #5
On Wed, Oct 16, 2019 at 9:59 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 10/16/19 9:34 PM, Dan Williams wrote:
> > On Tue, Oct 15, 2019 at 10:29 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 10/16/19 3:32 AM, Dan Williams wrote:
> >>> On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
> >>> <aneesh.kumar@linux.ibm.com> wrote:
> >>>>
> >>>> nvdimm core currently maps the full namespace to an ioremap range
> >>>> while probing the namespace mode. This can result in probe failures
> >>>> on architectures that have limited ioremap space.
> >>>
> >>> Is there a #define for that limit?
> >>>
> >>
> >> Arch specific #define. For example. ppc64 have different limits based on
> >> platform and translation mode. Hash translation with 4k PAGE_SIZE limit
> >> ioremap range to 8TB.
> >>
> >>>> nvdimm core can avoid this failure by only mapping the reserver block area to
> >>>> check for pfn superblock type and map the full namespace resource only before
> >>>> using the namespace. nvdimm core use ioremap range only for the raw and btt
> >>>> namespace and we can limit the max namespace size for these two modes. For
> >>>> both fsdax and devdax this change enables nvdimm to map namespace larger
> >>>> that ioremap limit.
> >>>
> >>> If the direct map has more space I think it would be better to add a
> >>> way to use that to map for all namespaces rather than introduce
> >>> arbitrary failures based on the mode.
> >>>
> >>> I would buy a performance argument to avoid overmapping, but for
> >>> namespace access compatibility where an alternate mapping method would
> >>> succeed I think we should aim for that to be used instead. Thoughts?
> >>>
> >>
> >> That would require to have struct page allocated for these range and
> >> both raw and btt don't need a struct page backing?
> >>
> >
> > I was thinking a new mapping interface that just consumed direct-map
> > space, but did not allocate pages.
> >
>
> Not sure how easy that would be. We are looking at having part of
> direct-map address not managed by any zone and then possibly archs need
> to be taught to handle these ? (for example for ppc64 we "bolt" direct
> map range where as we allow taking low level hash fault for I/O remap range)
>
> Even though you don't consider the patch as complete, considering the
> approach you outlined would require larger changes, do you think this
> patch can be accepted as a bug fix? Right now we can fail namespace
> initialization during boot or ndctl enable-namespace all.
>
> For example with ppc64 and I/O remap range limit of 8TB, we can
> individually create a 6TB namespace. We also allow to create multiple
> such namespaces. But if we try to enable them all together using ndctl
> enable-namespace all, that will fail with error
>
> [   54.259910] vmap allocation for size x failed: use vmalloc=<size> to
> increase size
>
> because we probe these namespaces in parallel.

The patch is incomplete, right? It does not fix the raw mode namespace
case, and that error message seems to indicate to the user how to fix
the problem. I was of the impression it was a fixed range in the
address map. Could you instead try to autodetect the potential pmem
usage and auto increase the vmap space?
Aneesh Kumar K.V Oct. 17, 2019, 2:43 a.m. UTC | #6
On 10/17/19 12:35 AM, Dan Williams wrote:
> On Wed, Oct 16, 2019 at 9:59 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 10/16/19 9:34 PM, Dan Williams wrote:
>>> On Tue, Oct 15, 2019 at 10:29 PM Aneesh Kumar K.V
>>> <aneesh.kumar@linux.ibm.com> wrote:
>>>>
>>>> On 10/16/19 3:32 AM, Dan Williams wrote:
>>>>> On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
>>>>> <aneesh.kumar@linux.ibm.com> wrote:
>>>>>>
>>>>>> nvdimm core currently maps the full namespace to an ioremap range
>>>>>> while probing the namespace mode. This can result in probe failures
>>>>>> on architectures that have limited ioremap space.
>>>>>
>>>>> Is there a #define for that limit?
>>>>>
>>>>
>>>> Arch specific #define. For example. ppc64 have different limits based on
>>>> platform and translation mode. Hash translation with 4k PAGE_SIZE limit
>>>> ioremap range to 8TB.
>>>>
>>>>>> nvdimm core can avoid this failure by only mapping the reserver block area to
>>>>>> check for pfn superblock type and map the full namespace resource only before
>>>>>> using the namespace. nvdimm core use ioremap range only for the raw and btt
>>>>>> namespace and we can limit the max namespace size for these two modes. For
>>>>>> both fsdax and devdax this change enables nvdimm to map namespace larger
>>>>>> that ioremap limit.
>>>>>
>>>>> If the direct map has more space I think it would be better to add a
>>>>> way to use that to map for all namespaces rather than introduce
>>>>> arbitrary failures based on the mode.
>>>>>
>>>>> I would buy a performance argument to avoid overmapping, but for
>>>>> namespace access compatibility where an alternate mapping method would
>>>>> succeed I think we should aim for that to be used instead. Thoughts?
>>>>>
>>>>
>>>> That would require to have struct page allocated for these range and
>>>> both raw and btt don't need a struct page backing?
>>>>
>>>
>>> I was thinking a new mapping interface that just consumed direct-map
>>> space, but did not allocate pages.
>>>
>>
>> Not sure how easy that would be. We are looking at having part of
>> direct-map address not managed by any zone and then possibly archs need
>> to be taught to handle these ? (for example for ppc64 we "bolt" direct
>> map range where as we allow taking low level hash fault for I/O remap range)
>>
>> Even though you don't consider the patch as complete, considering the
>> approach you outlined would require larger changes, do you think this
>> patch can be accepted as a bug fix? Right now we can fail namespace
>> initialization during boot or ndctl enable-namespace all.
>>
>> For example with ppc64 and I/O remap range limit of 8TB, we can
>> individually create a 6TB namespace. We also allow to create multiple
>> such namespaces. But if we try to enable them all together using ndctl
>> enable-namespace all, that will fail with error
>>
>> [   54.259910] vmap allocation for size x failed: use vmalloc=<size> to
>> increase size
>>
>> because we probe these namespaces in parallel.
> 
> The patch is incomplete, right?

Incomplete with respect to the detail that we still don't allow large 
raw and btt namespaces.


> It does not fix the raw mode namespace
> case, and that error message seems to indicate to the user how to fix
> the problem. I was of the impression it was a fixed range in the
> address map. Could you instead try to autodetect the potential pmem
> usage and auto increase the vmap space?
> 
The error is printed by generic code and the failures are due to fixed 
size. We can't workaround that using vmalloc=<size> option.

-aneesh
Dan Williams Oct. 17, 2019, 3:04 a.m. UTC | #7
On Wed, Oct 16, 2019 at 7:43 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 10/17/19 12:35 AM, Dan Williams wrote:
> > On Wed, Oct 16, 2019 at 9:59 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> On 10/16/19 9:34 PM, Dan Williams wrote:
> >>> On Tue, Oct 15, 2019 at 10:29 PM Aneesh Kumar K.V
> >>> <aneesh.kumar@linux.ibm.com> wrote:
> >>>>
> >>>> On 10/16/19 3:32 AM, Dan Williams wrote:
> >>>>> On Tue, Oct 15, 2019 at 8:33 AM Aneesh Kumar K.V
> >>>>> <aneesh.kumar@linux.ibm.com> wrote:
> >>>>>>
> >>>>>> nvdimm core currently maps the full namespace to an ioremap range
> >>>>>> while probing the namespace mode. This can result in probe failures
> >>>>>> on architectures that have limited ioremap space.
> >>>>>
> >>>>> Is there a #define for that limit?
> >>>>>
> >>>>
> >>>> Arch specific #define. For example. ppc64 have different limits based on
> >>>> platform and translation mode. Hash translation with 4k PAGE_SIZE limit
> >>>> ioremap range to 8TB.
> >>>>
> >>>>>> nvdimm core can avoid this failure by only mapping the reserver block area to
> >>>>>> check for pfn superblock type and map the full namespace resource only before
> >>>>>> using the namespace. nvdimm core use ioremap range only for the raw and btt
> >>>>>> namespace and we can limit the max namespace size for these two modes. For
> >>>>>> both fsdax and devdax this change enables nvdimm to map namespace larger
> >>>>>> that ioremap limit.
> >>>>>
> >>>>> If the direct map has more space I think it would be better to add a
> >>>>> way to use that to map for all namespaces rather than introduce
> >>>>> arbitrary failures based on the mode.
> >>>>>
> >>>>> I would buy a performance argument to avoid overmapping, but for
> >>>>> namespace access compatibility where an alternate mapping method would
> >>>>> succeed I think we should aim for that to be used instead. Thoughts?
> >>>>>
> >>>>
> >>>> That would require to have struct page allocated for these range and
> >>>> both raw and btt don't need a struct page backing?
> >>>>
> >>>
> >>> I was thinking a new mapping interface that just consumed direct-map
> >>> space, but did not allocate pages.
> >>>
> >>
> >> Not sure how easy that would be. We are looking at having part of
> >> direct-map address not managed by any zone and then possibly archs need
> >> to be taught to handle these ? (for example for ppc64 we "bolt" direct
> >> map range where as we allow taking low level hash fault for I/O remap range)
> >>
> >> Even though you don't consider the patch as complete, considering the
> >> approach you outlined would require larger changes, do you think this
> >> patch can be accepted as a bug fix? Right now we can fail namespace
> >> initialization during boot or ndctl enable-namespace all.
> >>
> >> For example with ppc64 and I/O remap range limit of 8TB, we can
> >> individually create a 6TB namespace. We also allow to create multiple
> >> such namespaces. But if we try to enable them all together using ndctl
> >> enable-namespace all, that will fail with error
> >>
> >> [   54.259910] vmap allocation for size x failed: use vmalloc=<size> to
> >> increase size
> >>
> >> because we probe these namespaces in parallel.
> >
> > The patch is incomplete, right?
>
> Incomplete with respect to the detail that we still don't allow large
> raw and btt namespaces.
>
>
> > It does not fix the raw mode namespace
> > case, and that error message seems to indicate to the user how to fix
> > the problem. I was of the impression it was a fixed range in the
> > address map. Could you instead try to autodetect the potential pmem
> > usage and auto increase the vmap space?
> >
> The error is printed by generic code and the failures are due to fixed
> size. We can't workaround that using vmalloc=<size> option.

Darn.

Ok, explain to me again how this patch helps. This just seems to delay
the inevitable failure a bit, but the end result is that the user
needs to pick and choose which namespaces to enable after the kernel
has tried to auto-probe namespaces.
Aneesh Kumar K.V Oct. 17, 2019, 3:18 a.m. UTC | #8
On 10/17/19 8:34 AM, Dan Williams wrote:
> On Wed, Oct 16, 2019 at 7:43 PM Aneesh Kumar K.V


....

>>>
>> The error is printed by generic code and the failures are due to fixed
>> size. We can't workaround that using vmalloc=<size> option.
> 
> Darn.
> 
> Ok, explain to me again how this patch helps. This just seems to delay
> the inevitable failure a bit, but the end result is that the user
> needs to pick and choose which namespaces to enable after the kernel
> has tried to auto-probe namespaces.
>

Right now we map the entire namespace using I/O remap range while 
probing and if we find the namespace mode to be fsdax or devdax we unmap 
the namespace and map them back using direct-map range. This temporary 
mapping cause transient failures if two large namespaces were probed at 
the same time. We try to map the second names space in the I/O remap 
range while the first one is already mapped there. ie,

If we have two 6TB namespaces with an I/O remap range limit of 8TB. We 
can individually create these namespaces and enable/disable them because 
they are within the 8TB limit. But if we try to probe them together,
and if namespace1 is already mapped in the I/O remap range we have only 
2TB space left in the I/O remap range and trying to probe the second 
namespace at the same time fails due to lack of space to map the 
namespace during probing.

This is not an issue with raw/btt, because they keep the namespace 
always mapped in the I/O remap range and if we are able to create a 
namespace we can enable them together. (we do have the issue of creating 
large raw/btt namespace while other raw/btt namespaces are disabled and 
trying to enable them all together).

The fix proposed in this patch is to not map the full namespace range 
while probing. We just need to map the reserve block to find the 
namespace mode. Based on the detected namespace mode, we either use 
direct-map range or I/O range to map the full namespace.

-aneesh
Dan Williams Oct. 17, 2019, 4:12 a.m. UTC | #9
On Wed, Oct 16, 2019 at 8:19 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 10/17/19 8:34 AM, Dan Williams wrote:
> > On Wed, Oct 16, 2019 at 7:43 PM Aneesh Kumar K.V
>
>
> ....
>
> >>>
> >> The error is printed by generic code and the failures are due to fixed
> >> size. We can't workaround that using vmalloc=<size> option.
> >
> > Darn.
> >
> > Ok, explain to me again how this patch helps. This just seems to delay
> > the inevitable failure a bit, but the end result is that the user
> > needs to pick and choose which namespaces to enable after the kernel
> > has tried to auto-probe namespaces.
> >
>
> Right now we map the entire namespace using I/O remap range while
> probing and if we find the namespace mode to be fsdax or devdax we unmap
> the namespace and map them back using direct-map range. This temporary
> mapping cause transient failures if two large namespaces were probed at
> the same time. We try to map the second names space in the I/O remap
> range while the first one is already mapped there. ie,
>
> If we have two 6TB namespaces with an I/O remap range limit of 8TB. We
> can individually create these namespaces and enable/disable them because
> they are within the 8TB limit. But if we try to probe them together,
> and if namespace1 is already mapped in the I/O remap range we have only
> 2TB space left in the I/O remap range and trying to probe the second
> namespace at the same time fails due to lack of space to map the
> namespace during probing.
>
> This is not an issue with raw/btt, because they keep the namespace
> always mapped in the I/O remap range and if we are able to create a
> namespace we can enable them together. (we do have the issue of creating
> large raw/btt namespace while other raw/btt namespaces are disabled and
> trying to enable them all together).
>
> The fix proposed in this patch is to not map the full namespace range
> while probing. We just need to map the reserve block to find the
> namespace mode. Based on the detected namespace mode, we either use
> direct-map range or I/O range to map the full namespace.
>

Ah, ok, you're not trying to address the case where raw and btt
namespaces don't fit you're trying to fix the case where raw and btt
namespaces *do* fit in the vmap, but pfn / fsdax namespaces exceed the
vmap.

I would highlight that in the changelog, namely that an otherwise
working configuration with two namespaces (one btt and one pfn), where
the btt namespace consumes almost all the vmap space, will fail to
enable due to needing too much temporary vmap space to initialize the
pfn mode namespace.

Let's also make this a bit more self contained to devm_nsio_enable()
and add a length parameter. Then when btt is ready to start accessing
more than the initial 8K it can be just be called again with the full
namespace size. Then devm_nsio_enable can notice when the mapping
needs to be expanded vs established.

Although btt should not be making assumptions about the underlying
namepsace type so it would need to call a new devm_ndns_enable()
wrapper that calls devm_nsio_enable() for nsio namespaces and is a nop
for ndblk namespaces.
Aneesh Kumar K.V Oct. 17, 2019, 7:10 a.m. UTC | #10
Dan Williams <dan.j.williams@intel.com> writes:

> On Wed, Oct 16, 2019 at 8:19 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> On 10/17/19 8:34 AM, Dan Williams wrote:
>> > On Wed, Oct 16, 2019 at 7:43 PM Aneesh Kumar K.V
>>
>>
>> ....
>>
>> >>>
>> >> The error is printed by generic code and the failures are due to fixed
>> >> size. We can't workaround that using vmalloc=<size> option.
>> >
>> > Darn.
>> >
>> > Ok, explain to me again how this patch helps. This just seems to delay
>> > the inevitable failure a bit, but the end result is that the user
>> > needs to pick and choose which namespaces to enable after the kernel
>> > has tried to auto-probe namespaces.
>> >
>>
>> Right now we map the entire namespace using I/O remap range while
>> probing and if we find the namespace mode to be fsdax or devdax we unmap
>> the namespace and map them back using direct-map range. This temporary
>> mapping cause transient failures if two large namespaces were probed at
>> the same time. We try to map the second names space in the I/O remap
>> range while the first one is already mapped there. ie,
>>
>> If we have two 6TB namespaces with an I/O remap range limit of 8TB. We
>> can individually create these namespaces and enable/disable them because
>> they are within the 8TB limit. But if we try to probe them together,
>> and if namespace1 is already mapped in the I/O remap range we have only
>> 2TB space left in the I/O remap range and trying to probe the second
>> namespace at the same time fails due to lack of space to map the
>> namespace during probing.
>>
>> This is not an issue with raw/btt, because they keep the namespace
>> always mapped in the I/O remap range and if we are able to create a
>> namespace we can enable them together. (we do have the issue of creating
>> large raw/btt namespace while other raw/btt namespaces are disabled and
>> trying to enable them all together).
>>
>> The fix proposed in this patch is to not map the full namespace range
>> while probing. We just need to map the reserve block to find the
>> namespace mode. Based on the detected namespace mode, we either use
>> direct-map range or I/O range to map the full namespace.
>>
>
> Ah, ok, you're not trying to address the case where raw and btt
> namespaces don't fit you're trying to fix the case where raw and btt
> namespaces *do* fit in the vmap, but pfn / fsdax namespaces exceed the
> vmap.
>
> I would highlight that in the changelog, namely that an otherwise
> working configuration with two namespaces (one btt and one pfn), where
> the btt namespace consumes almost all the vmap space, will fail to
> enable due to needing too much temporary vmap space to initialize the
> pfn mode namespace.
>
> Let's also make this a bit more self contained to devm_nsio_enable()
> and add a length parameter. Then when btt is ready to start accessing
> more than the initial 8K it can be just be called again with the full
> namespace size. Then devm_nsio_enable can notice when the mapping
> needs to be expanded vs established.
>
> Although btt should not be making assumptions about the underlying
> namepsace type so it would need to call a new devm_ndns_enable()
> wrapper that calls devm_nsio_enable() for nsio namespaces and is a nop
> for ndblk namespaces.

How about the below?

modified   drivers/nvdimm/pmem.c
@@ -491,17 +491,26 @@ static int pmem_attach_disk(struct device *dev,
 static int nd_pmem_probe(struct device *dev)
 {
 	int ret;
+	struct nd_namespace_io *nsio;
 	struct nd_namespace_common *ndns;
 
 	ndns = nvdimm_namespace_common_probe(dev);
 	if (IS_ERR(ndns))
 		return PTR_ERR(ndns);
 
-	if (devm_nsio_enable(dev, to_nd_namespace_io(&ndns->dev)))
-		return -ENXIO;
+	nsio = to_nd_namespace_io(&ndns->dev);
 
-	if (is_nd_btt(dev))
+	if (is_nd_btt(dev)) {
+		/*
+		 * Map with resource size
+		 */
+		if (devm_nsio_enable(dev, nsio, resource_size(&nsio->res)))
+			return -ENXIO;
 		return nvdimm_namespace_attach_btt(ndns);
+	}
+
+	if (devm_nsio_enable(dev, nsio, info_block_reserve()))
+		return -ENXIO;
 
 	if (is_nd_pfn(dev))
 		return pmem_attach_disk(dev, ndns);

-aneesh
diff mbox series

Patch

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 677d6f45b5c4..755192332269 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -302,7 +302,7 @@  static int nd_blk_probe(struct device *dev)
 
 	ndns->rw_bytes = nsblk_rw_bytes;
 	if (is_nd_btt(dev))
-		return nvdimm_namespace_attach_btt(ndns);
+		return nvdimm_namespace_attach_btt(dev, ndns);
 	else if (nd_btt_probe(dev, ndns) == 0) {
 		/* we'll come back as btt-blk */
 		return -ENXIO;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3e9f45aec8d1..a1e213e8ef81 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1668,9 +1668,12 @@  static void btt_fini(struct btt *btt)
 	}
 }
 
-int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
+int nvdimm_namespace_attach_btt(struct device *dev,
+		struct nd_namespace_common *ndns)
 {
+	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	struct nd_btt *nd_btt = to_nd_btt(ndns->claim);
+	struct resource *res = &nsio->res;
 	struct nd_region *nd_region;
 	struct btt_sb *btt_sb;
 	struct btt *btt;
@@ -1685,6 +1688,14 @@  int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns)
 	if (!btt_sb)
 		return -ENOMEM;
 
+	/*
+	 * Remove the old mapping and do the full mapping.
+	 */
+	devm_memunmap(dev, nsio->addr);
+	nsio->addr = devm_memremap(dev, res->start, resource_size(res),
+			ARCH_MEMREMAP_PMEM);
+	if (IS_ERR(nsio->addr))
+		return -ENXIO;
 	/*
 	 * If this returns < 0, that is ok as it just means there wasn't
 	 * an existing BTT, and we're creating a new one. We still need to
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 2985ca949912..9f2e6646fcd4 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -318,7 +318,7 @@  int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio)
 	nvdimm_badblocks_populate(to_nd_region(ndns->dev.parent), &nsio->bb,
 			&nsio->res);
 
-	nsio->addr = devm_memremap(dev, res->start, resource_size(res),
+	nsio->addr = devm_memremap(dev, res->start, info_block_reserve(),
 			ARCH_MEMREMAP_PMEM);
 
 	return PTR_ERR_OR_ZERO(nsio->addr);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index ee5c04070ef9..f51d51aa2f96 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -363,7 +363,7 @@  struct resource *nvdimm_allocate_dpa(struct nvdimm_drvdata *ndd,
 resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
 bool nvdimm_namespace_locked(struct nd_namespace_common *ndns);
 struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev);
-int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns);
+int nvdimm_namespace_attach_btt(struct device *dev, struct nd_namespace_common *ndns);
 int nvdimm_namespace_detach_btt(struct nd_btt *nd_btt);
 const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
 		char *name);
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
index acb19517f678..f4856c87d01c 100644
--- a/drivers/nvdimm/pfn.h
+++ b/drivers/nvdimm/pfn.h
@@ -36,4 +36,10 @@  struct nd_pfn_sb {
 	__le64 checksum;
 };
 
+static inline u32 info_block_reserve(void)
+{
+	return ALIGN(SZ_8K, PAGE_SIZE);
+}
+
+
 #endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 60d81fae06ee..e49aa9a0fd04 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -635,11 +635,6 @@  int nd_pfn_probe(struct device *dev, struct nd_namespace_common *ndns)
 }
 EXPORT_SYMBOL(nd_pfn_probe);
 
-static u32 info_block_reserve(void)
-{
-	return ALIGN(SZ_8K, PAGE_SIZE);
-}
-
 /*
  * We hotplug memory at sub-section granularity, pad the reserved area
  * from the previous section base to the namespace base address.
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f9f76f6ba07b..69956e49ec56 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -501,7 +501,7 @@  static int nd_pmem_probe(struct device *dev)
 		return -ENXIO;
 
 	if (is_nd_btt(dev))
-		return nvdimm_namespace_attach_btt(ndns);
+		return nvdimm_namespace_attach_btt(dev, ndns);
 
 	if (is_nd_pfn(dev))
 		return pmem_attach_disk(dev, ndns);