[2/3] x86: add a is_e820_ram() helper
diff mbox

Message ID 20150326093424.GA28217@lst.de
State New, archived
Headers show

Commit Message

Christoph Hellwig March 26, 2015, 9:34 a.m. UTC
On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
> This is_e820_ram() factoring out becomes really messy in patch #3.
> 
> So you left out a bunch of places making comparisons with E820_RAM, 
> notably e820_reserve_resources_late() and memblock_x86_fill() - and of 
> course those have to be left out, otherwise NVRAM might be registered 
> and used as real kernel RAM!
> 
> And this shows the weakness and confusion caused by the factoring out 
> of is_e820_ram() and then adding E820_PMEM to its definition...
> 
> I'd rather you add explicit checks to E820_PMEM (why not E820_PRAM, to 
> keep in line with the E820_RAM name?), and not lie about 
> is_e820_ram(). It should result in the exact same end result, with 
> less confusion.
> 
> I have no fundamental objections to the driver otherwise.

Does this patch (replaces patches 2 and 3) look better to you?

---
From 4a6fdc8433559d649d7bf707f72aafa4488f2d23 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 25 Mar 2015 12:24:11 +0100
Subject: x86: add support for the non-standard protected e820 type

Various recent BIOSes support NVDIMMs or ADR using a non-standard
e820 memory type, and Intel supplied reference Linux code using this
type to various vendors.

Wire this e820 table type up to export platform devices for the pmem
driver so that we can use it in Linux, and also provide a memmap=
argument to manually tag memory as protected, which can be used
if the BIOSs doesn't use the standard nonstandard interface, or
we just want to test the pmem driver with regular memory.

Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 Documentation/kernel-parameters.txt |  6 ++++
 arch/x86/Kconfig                    | 10 ++++++
 arch/x86/include/asm/setup.h        |  6 ++++
 arch/x86/include/uapi/asm/e820.h    | 10 ++++++
 arch/x86/kernel/Makefile            |  1 +
 arch/x86/kernel/e820.c              | 31 ++++++++++++----
 arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/setup.c             |  2 ++
 8 files changed, 130 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/kernel/pmem.c

Comments

Ingo Molnar March 26, 2015, 10:04 a.m. UTC | #1
* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
> > This is_e820_ram() factoring out becomes really messy in patch #3.
> > 
> > So you left out a bunch of places making comparisons with E820_RAM, 
> > notably e820_reserve_resources_late() and memblock_x86_fill() - and of 
> > course those have to be left out, otherwise NVRAM might be registered 
> > and used as real kernel RAM!
> > 
> > And this shows the weakness and confusion caused by the factoring out 
> > of is_e820_ram() and then adding E820_PMEM to its definition...
> > 
> > I'd rather you add explicit checks to E820_PMEM (why not E820_PRAM, to 
> > keep in line with the E820_RAM name?), and not lie about 
> > is_e820_ram(). It should result in the exact same end result, with 
> > less confusion.
> > 
> > I have no fundamental objections to the driver otherwise.
> 
> Does this patch (replaces patches 2 and 3) look better to you?

Yeah, the code is much clearer now:

  Acked-by: Ingo Molnar <mingo@kernel.org>

What tree is this intended for? Should I pick up the x86 bits?

Thanks,

	Ingo
Christoph Hellwig March 26, 2015, 10:19 a.m. UTC | #2
On Thu, Mar 26, 2015 at 11:04:13AM +0100, Ingo Molnar wrote:
> Yeah, the code is much clearer now:
> 
>   Acked-by: Ingo Molnar <mingo@kernel.org>
> 
> What tree is this intended for? Should I pick up the x86 bits?

The x86 bits really need to go through the x86 tree.  The pmem driver
itself would normally go through the block tree, but if Jens is fine with
it sending it through the x86 tree as well would allow us to have a single
complete tree for testing.
Ingo Molnar March 26, 2015, 10:28 a.m. UTC | #3
* Christoph Hellwig <hch@lst.de> wrote:

> On Thu, Mar 26, 2015 at 11:04:13AM +0100, Ingo Molnar wrote:
> > Yeah, the code is much clearer now:
> > 
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> > 
> > What tree is this intended for? Should I pick up the x86 bits?
> 
> The x86 bits really need to go through the x86 tree.  The pmem 
> driver itself would normally go through the block tree, but if Jens 
> is fine with it sending it through the x86 tree as well would allow 
> us to have a single complete tree for testing.

btw., there's half a dozen block drivers in arch/* platform code, so 
in theory even the block driver could be merged there - but I agree 
that it's much cleaner and more generic in drivers/block/.

Jens?

Thanks,

	Ingo
Christoph Hellwig March 26, 2015, 10:29 a.m. UTC | #4
On Thu, Mar 26, 2015 at 11:28:10AM +0100, Ingo Molnar wrote:
> btw., there's half a dozen block drivers in arch/* platform code, so 
> in theory even the block driver could be merged there - but I agree 
> that it's much cleaner and more generic in drivers/block/.

The block driver isn't really architecture specific.  With the upcoming
UEFI interface to discover persistent memory arm, arm64 and in theory ia64
will immeditately benefit from it.  I also expect it to show up on powerpc
sooner or later at least.
Boaz Harrosh March 26, 2015, 3:49 p.m. UTC | #5
On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
<>

Please re-post this patch stand alone because git am on this will
Give me the wrong title and commit message

small comments ...

> From: Christoph Hellwig <hch@lst.de>
> Date: Wed, 25 Mar 2015 12:24:11 +0100
> Subject: x86: add support for the non-standard protected e820 type
> 
> Various recent BIOSes support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
> 
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the BIOSs doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
> 
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  Documentation/kernel-parameters.txt |  6 ++++
>  arch/x86/Kconfig                    | 10 ++++++
>  arch/x86/include/asm/setup.h        |  6 ++++
>  arch/x86/include/uapi/asm/e820.h    | 10 ++++++
>  arch/x86/kernel/Makefile            |  1 +
>  arch/x86/kernel/e820.c              | 31 ++++++++++++----
>  arch/x86/kernel/pmem.c              | 70 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/setup.c             |  2 ++
>  8 files changed, 130 insertions(+), 6 deletions(-)
>  create mode 100644 arch/x86/kernel/pmem.c
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index bfcb1a6..c87122d 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1965,6 +1965,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			         or
>  			         memmap=0x10000$0x18690000
>  
> +	memmap=nn[KMG]!ss[KMG]
> +			[KNL,X86] Mark specific memory as protected.
> +			Region of memory to be used, from ss to ss+nn.
> +			The memory region may be marked as e820 type 12 (0xc)
> +			and is NVDIMM or ADR memory.
> +

Do we need to escape "\!" this character on grub command line ? It might
help to note that. I did like the original "|" BTW

>  	memory_corruption_check=0/1 [X86]
>  			Some BIOSes seem to corrupt the first 64k of
>  			memory when doing things like suspend/resume.
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b7d31ca..c0e8ee3 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1430,6 +1430,16 @@ config ILLEGAL_POINTER_VALUE
>  
>  source "mm/Kconfig"
>  
> +config X86_PMEM_LEGACY
> +	bool "Support non-standard NVDIMMs and ADR protected memory"
> +	help
> +	  Treat memory marked using the non-standard e820 type of 12 as used
> +	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
> +	  The kernel will offer these regions to the pmem driver so
> +	  they can be used for persistent storage.
> +
> +	  Say Y if unsure.
> +
>  config HIGHPTE
>  	bool "Allocate 3rd-level pagetables from highmem"
>  	depends on HIGHMEM
> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
> index ff4e7b2..2352fde 100644
> --- a/arch/x86/include/asm/setup.h
> +++ b/arch/x86/include/asm/setup.h
> @@ -57,6 +57,12 @@ extern void x86_ce4100_early_setup(void);
>  static inline void x86_ce4100_early_setup(void) { }
>  #endif
>  
> +#ifdef CONFIG_X86_PMEM_LEGACY
> +void reserve_pmem(void);
> +#else
> +static inline void reserve_pmem(void) { }
> +#endif
> +
>  #ifndef _SETUP
>  
>  #include <asm/espfix.h>
> diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
> index d993e33..ce0d0bf 100644
> --- a/arch/x86/include/uapi/asm/e820.h
> +++ b/arch/x86/include/uapi/asm/e820.h
> @@ -33,6 +33,16 @@
>  #define E820_NVS	4
>  #define E820_UNUSABLE	5
>  
> +/*
> + * This is a non-standardized way to represent ADR or NVDIMM regions that
> + * persist over a reboot.  The kernel will ignore their special capabilities
> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
> + *
> + * Note that older platforms also used 6 for the same type of memory,
> + * but newer versions switched to 12 as 6 was assigned differently.  Some
> + * time they will learn..
> + */
> +#define E820_PRAM	12

Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
to enable is _PMEM_, the driver stack that gets loaded is pmem,
so PRAM is unexpected.

Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
but there are other not RAM technologies that can be supported exactly
the same way.
MEM is a more general name meaning "on the memory bus". I think.

I would love the consistency.

>  
>  /*
>   * reserved RAM used by kernel itself
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cdb1b70..971f18c 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -94,6 +94,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
>  obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
>  obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
> +obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
>  
>  obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
>  
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 46201de..4bd525a 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -149,6 +149,9 @@ static void __init e820_print_type(u32 type)
>  	case E820_UNUSABLE:
>  		printk(KERN_CONT "unusable");
>  		break;
> +	case E820_PRAM:
> +		printk(KERN_CONT "persistent (type %u)", type);

This case can only mean 12 in this patch. (I think historically
you had a Kconfig to set its Number

> +		break;
>  	default:
>  		printk(KERN_CONT "type %u", type);
>  		break;
> @@ -688,8 +691,15 @@ void __init e820_mark_nosave_regions(unsigned long limit_pfn)
>  			register_nosave_region(pfn, PFN_UP(ei->addr));
>  
>  		pfn = PFN_DOWN(ei->addr + ei->size);
> -		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
> +
> +		switch (ei->type) {
> +		case E820_RAM:
> +		case E820_PRAM:
> +		case E820_RESERVED_KERN:
> +			break;
> +		default:
>  			register_nosave_region(PFN_UP(ei->addr), pfn);
> +		}
>  
>  		if (pfn >= limit_pfn)
>  			break;
> @@ -748,7 +758,7 @@ u64 __init early_reserve_e820(u64 size, u64 align)
>  /*
>   * Find the highest page frame number we have available
>   */
> -static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
> +static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
>  {
>  	int i;
>  	unsigned long last_pfn = 0;
> @@ -759,7 +769,11 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  		unsigned long start_pfn;
>  		unsigned long end_pfn;
>  
> -		if (ei->type != type)
> +		/*
> +		 * Persistent memory is accounted as ram for purposes of
> +		 * establishing max_pfn and mem_map.
> +		 */
> +		if (ei->type != E820_RAM && ei->type != E820_PRAM)
>  			continue;
>  
>  		start_pfn = ei->addr >> PAGE_SHIFT;
> @@ -784,12 +798,12 @@ static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
>  }
>  unsigned long __init e820_end_of_ram_pfn(void)
>  {
> -	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
> +	return e820_end_pfn(MAX_ARCH_PFN);
>  }
>  
>  unsigned long __init e820_end_of_low_ram_pfn(void)
>  {
> -	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
> +	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
>  }
>  
>  static void early_panic(char *msg)
> @@ -866,6 +880,9 @@ static int __init parse_memmap_one(char *p)
>  	} else if (*p == '$') {
>  		start_at = memparse(p+1, &p);
>  		e820_add_region(start_at, mem_size, E820_RESERVED);
> +	} else if (*p == '!') {
> +		start_at = memparse(p+1, &p);
> +		e820_add_region(start_at, mem_size, E820_PRAM);
>  	} else
>  		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
>  
> @@ -907,6 +924,7 @@ static inline const char *e820_type_to_string(int e820_type)
>  	case E820_ACPI:	return "ACPI Tables";
>  	case E820_NVS:	return "ACPI Non-volatile Storage";
>  	case E820_UNUSABLE:	return "Unusable memory";
> +	case E820_PRAM: return "Persistent RAM";

if you change E820_PRAM, then Also here

>  	default:	return "reserved";
>  	}
>  }
> @@ -941,7 +959,8 @@ void __init e820_reserve_resources(void)
>  		 * pcibios_resource_survey()
>  		 */
>  		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
> -			res->flags |= IORESOURCE_BUSY;
> +			if (e820.map[i].type != E820_PRAM)
> +				res->flags |= IORESOURCE_BUSY;
>  			insert_resource(&iomem_resource, res);
>  		}
>  		res++;
> diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
> new file mode 100644
> index 0000000..f970048
> --- /dev/null
> +++ b/arch/x86/kernel/pmem.c
> @@ -0,0 +1,70 @@
> +/*
> + * Copyright (c) 2009, Intel Corporation.
> + * Copyright (c) 2015, Christoph Hellwig.
> + */
> +#include <linux/memblock.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <asm/e820.h>
> +#include <asm/page_types.h>
> +#include <asm/setup.h>
> +
> +void __init reserve_pmem(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *ei = &e820.map[i];
> +
> +		if (ei->type != E820_PRAM)
> +			continue;
> +
> +		memblock_reserve(ei->addr, ei->addr + ei->size);
> +		max_pfn_mapped = init_memory_mapping(
> +				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
> +				ei->addr + ei->size);
> +	}
> +}
> +
> +static __init void register_pmem_device(struct resource *res)
> +{
> +	struct platform_device *pdev;
> +	int error;
> +
> +	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
> +	if (!pdev)
> +		return;
> +
> +	error = platform_device_add_resources(pdev, res, 1);
> +	if (error)
> +		goto out_put_pdev;
> +
> +	error = platform_device_add(pdev);
> +	if (error)
> +		goto out_put_pdev;
> +	return;
> +out_put_pdev:
> +	dev_warn(&pdev->dev, "failed to add pmem device!\n");
> +	platform_device_put(pdev);
> +}
> +
> +static __init int register_pmem_devices(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *ei = &e820.map[i];
> +
> +		if (ei->type == E820_PRAM) {

See here it would be cleaner to ask for E820_PMEM in a 
register_pmem_devices member

Just my $0.017

Thanks
Boaz

> +			struct resource res = {
> +				.flags	= IORESOURCE_MEM,
> +				.start	= ei->addr,
> +				.end	= ei->addr + ei->size - 1,
> +			};
> +			register_pmem_device(&res);
> +		}
> +	}
> +
> +	return 0;
> +}
> +device_initcall(register_pmem_devices);
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 0a2421c..f2bed2b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
>  
>  	early_acpi_boot_init();
>  
> +	reserve_pmem();
> +
>  	initmem_init();
>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>  
>
Dan Williams March 26, 2015, 4:02 p.m. UTC | #6
On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
> On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
>> +/*
>> + * This is a non-standardized way to represent ADR or NVDIMM regions that
>> + * persist over a reboot.  The kernel will ignore their special capabilities
>> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
>> + *
>> + * Note that older platforms also used 6 for the same type of memory,
>> + * but newer versions switched to 12 as 6 was assigned differently.  Some
>> + * time they will learn..
>> + */
>> +#define E820_PRAM    12
>
> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> to enable is _PMEM_, the driver stack that gets loaded is pmem,
> so PRAM is unexpected.
>
> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> but there are other not RAM technologies that can be supported exactly
> the same way.
> MEM is a more general name meaning "on the memory bus". I think.
>
> I would love the consistency.

One of nice side of effects of having a "PRAM" name is that we can
later add a UEFI "PMEM" type where the distinction is thsy "PRAM" is
included in the system memory map by default and "PMEM" is analogous
to "IOMEM".  Just a thought...
Boaz Harrosh March 26, 2015, 4:07 p.m. UTC | #7
On 03/26/2015 06:02 PM, Dan Williams wrote:
> On Thu, Mar 26, 2015 at 8:49 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 03/26/2015 11:34 AM, Christoph Hellwig wrote:
>>> +/*
>>> + * This is a non-standardized way to represent ADR or NVDIMM regions that
>>> + * persist over a reboot.  The kernel will ignore their special capabilities
>>> + * unless the CONFIG_X86_PMEM_LEGACY option is set.
>>> + *
>>> + * Note that older platforms also used 6 for the same type of memory,
>>> + * but newer versions switched to 12 as 6 was assigned differently.  Some
>>> + * time they will learn..
>>> + */
>>> +#define E820_PRAM    12
>>
>> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
>> to enable is _PMEM_, the driver stack that gets loaded is pmem,
>> so PRAM is unexpected.
>>
>> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
>> but there are other not RAM technologies that can be supported exactly
>> the same way.
>> MEM is a more general name meaning "on the memory bus". I think.
>>
>> I would love the consistency.
> 
> One of nice side of effects of having a "PRAM" name is that we can
> later add a UEFI "PMEM" type where the distinction is thsy "PRAM" is
> included in the system memory map by default and "PMEM" is analogous
> to "IOMEM".  Just a thought...
> 

Than lets say E820_PMEM_12, but not PRAM for sure. Also would UEFI be
E820_XXX will it not be a UEFI_PMEM ??

For me I hate RAM because it became to mean a technology, maybe then
the same name as the Kconfig PMEM_LEGACY

Thanks
Boaz
Christoph Hellwig March 26, 2015, 4:43 p.m. UTC | #8
On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > +	memmap=nn[KMG]!ss[KMG]
> > +			[KNL,X86] Mark specific memory as protected.
> > +			Region of memory to be used, from ss to ss+nn.
> > +			The memory region may be marked as e820 type 12 (0xc)
> > +			and is NVDIMM or ADR memory.
> > +
> 
> Do we need to escape "\!" this character on grub command line ? It might
> help to note that. I did like the original "|" BTW

No need to escape it on the kvm command line, which is where I tested
this flag only so far.  If there is a strong argument for "|" I'm happy
to change it.

> > +#define E820_PRAM	12
> 
> Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> to enable is _PMEM_, the driver stack that gets loaded is pmem,
> so PRAM is unexpected.
> 
> Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> but there are other not RAM technologies that can be supported exactly
> the same way.
> MEM is a more general name meaning "on the memory bus". I think.
> 
> I would love the consistency.

Ingo asked for the PRAM name, I don't really care either way.

> > +	case E820_PRAM:
> > +		printk(KERN_CONT "persistent (type %u)", type);
> 
> This case can only mean 12 in this patch. (I think historically
> you had a Kconfig to set its Number

Indeed.  I kept it in case people hack their kernels, but I can remove
it if people feel it's too ugly.
Elliott, Robert (Server Storage) March 26, 2015, 6:46 p.m. UTC | #9
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Thursday, March 26, 2015 11:43 AM
> To: Boaz Harrosh
> Cc: Christoph Hellwig; Ingo Molnar; linux-nvdimm@ml01.01.org; linux-
> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
> ross.zwisler@linux.intel.com; axboe@kernel.dk
> Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper
> 
> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > > +	memmap=nn[KMG]!ss[KMG]
> > > +			[KNL,X86] Mark specific memory as protected.
> > > +			Region of memory to be used, from ss to ss+nn.
> > > +			The memory region may be marked as e820 type 12 (0xc)
> > > +			and is NVDIMM or ADR memory.
> > > +
> >
> > Do we need to escape "\!" this character on grub command line ? It might
> > help to note that. I did like the original "|" BTW
> 
> No need to escape it on the kvm command line, which is where I tested
> this flag only so far.  If there is a strong argument for "|" I'm happy
> to change it.

I agree with Boaz that ! is a nuisance if loading pmem as a module
with modprobe from bash. 

The ! does work fine in the grub2 command line if the driver is 
built-in.  I added it to /etc/default/grub like this:
    GRUB_CMDLINE_LINUX="<other parameters> memmap=8G!18G"
and grub-mkconfig placed it in /boot/efi/EFI/centos/grub.cfg
with no issues.
Dan Williams March 26, 2015, 7:25 p.m. UTC | #10
On Thu, Mar 26, 2015 at 11:46 AM, Elliott, Robert (Server Storage)
<Elliott@hp.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
>> Sent: Thursday, March 26, 2015 11:43 AM
>> To: Boaz Harrosh
>> Cc: Christoph Hellwig; Ingo Molnar; linux-nvdimm@ml01.01.org; linux-
>> fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
>> ross.zwisler@linux.intel.com; axboe@kernel.dk
>> Subject: Re: [PATCH 2/3] x86: add a is_e820_ram() helper
>>
>> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
>> > > + memmap=nn[KMG]!ss[KMG]
>> > > +                 [KNL,X86] Mark specific memory as protected.
>> > > +                 Region of memory to be used, from ss to ss+nn.
>> > > +                 The memory region may be marked as e820 type 12 (0xc)
>> > > +                 and is NVDIMM or ADR memory.
>> > > +
>> >
>> > Do we need to escape "\!" this character on grub command line ? It might
>> > help to note that. I did like the original "|" BTW
>>
>> No need to escape it on the kvm command line, which is where I tested
>> this flag only so far.  If there is a strong argument for "|" I'm happy
>> to change it.
>
> I agree with Boaz that ! is a nuisance if loading pmem as a module
> with modprobe from bash.

This is a core kernel command line, not a module parameter.  I'm not
saying that it should stay "!", but modprobe will not need to deal
with it.
Ross Zwisler March 26, 2015, 8:53 p.m. UTC | #11
On Thu, 2015-03-26 at 17:43 +0100, Christoph Hellwig wrote:
> On Thu, Mar 26, 2015 at 05:49:38PM +0200, Boaz Harrosh wrote:
> > > +#define E820_PRAM	12
> > 
> > Why the PRAM Name. For one 2/3 of this patch say PMEM the Kconfig
> > to enable is _PMEM_, the driver stack that gets loaded is pmem,
> > so PRAM is unexpected.
> > 
> > Also I do believe PRAM is not the correct name. Yes NvDIMMs are RAM,
> > but there are other not RAM technologies that can be supported exactly
> > the same way.
> > MEM is a more general name meaning "on the memory bus". I think.
> > 
> > I would love the consistency.
> 
> Ingo asked for the PRAM name, I don't really care either way.

I also prefer "E820_PMEM", fwiw.
Yinghai Lu March 26, 2015, 10:59 p.m. UTC | #12
On Thu, Mar 26, 2015 at 2:34 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Mar 26, 2015 at 10:02:15AM +0100, Ingo Molnar wrote:
>> This is_e820_ram() factoring out becomes really messy in patch #3.
...
> Does this patch (replaces patches 2 and 3) look better to you?
>
> ---
> From 4a6fdc8433559d649d7bf707f72aafa4488f2d23 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Wed, 25 Mar 2015 12:24:11 +0100
> Subject: x86: add support for the non-standard protected e820 type
>
> Various recent BIOSes support NVDIMMs or ADR using a non-standard
> e820 memory type, and Intel supplied reference Linux code using this
> type to various vendors.
>
> Wire this e820 table type up to export platform devices for the pmem
> driver so that we can use it in Linux, and also provide a memmap=
> argument to manually tag memory as protected, which can be used
> if the BIOSs doesn't use the standard nonstandard interface, or
> we just want to test the pmem driver with regular memory.
>
> Based on an earlier patch from Dave Jiang <dave.jiang@intel.com>
>
...
> diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
> new file mode 100644
> index 0000000..f970048
> --- /dev/null
> +++ b/arch/x86/kernel/pmem.c
...
> +
> +void __init reserve_pmem(void)
> +{
> +       int i;
> +
> +       for (i = 0; i < e820.nr_map; i++) {
> +               struct e820entry *ei = &e820.map[i];
> +
> +               if (ei->type != E820_PRAM)
> +                       continue;
> +
> +               memblock_reserve(ei->addr, ei->addr + ei->size);
> +               max_pfn_mapped = init_memory_mapping(
> +                               ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
> +                               ei->addr + ei->size);
> +       }
> +}

What do you want to get here?

You did not modify memblock_x86_fill() to treat
E820_PRAM as E820_RAM, so memblock will not have any
entry for E820_PRAM, so you do not need to call memblock_reserve
there.

And the same time,  init_memory_mapping() will call
init_range_memory_mapping/for_each_mem_pfn_range() to
set kernel mapping for memory range in memblock only.
So here calling init_memory_mapping will not do anything.
then just drop calling to that init_memory_mapping.
--- so will not kernel mapping pmem, is that what you intended to have?

After those two changes, You do not need this reserve_pmem at all.
Just drop it.


> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 0a2421c..f2bed2b 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1158,6 +1158,8 @@ void __init setup_arch(char **cmdline_p)
>
>         early_acpi_boot_init();
>
> +       reserve_pmem();
> +

Not needed.

>         initmem_init();
>         dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);

Thanks

Yinghai
Christoph Hellwig March 27, 2015, 8:10 a.m. UTC | #13
On Thu, Mar 26, 2015 at 03:59:28PM -0700, Yinghai Lu wrote:
> What do you want to get here?
> 
> You did not modify memblock_x86_fill() to treat
> E820_PRAM as E820_RAM, so memblock will not have any
> entry for E820_PRAM, so you do not need to call memblock_reserve
> there.
> 
> And the same time,  init_memory_mapping() will call
> init_range_memory_mapping/for_each_mem_pfn_range() to
> set kernel mapping for memory range in memblock only.
> So here calling init_memory_mapping will not do anything.
> then just drop calling to that init_memory_mapping.
> --- so will not kernel mapping pmem, is that what you intended to have?

I think the intent of the old Intel code was to indeed map the pmem
into KVA space.  That got broken when I forward ported it to use
memblocks.  However the current pmem infrastructure doesn't need the
KVA mapping, so I can remove it for now.

However we have heated discussions about how to do I/O to pmem, and
KVA mapping is one of the options.  If we got with that option I might
bring this code back in a fixed up version.

Patch
diff mbox

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..c87122d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..c0e8ee3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1430,6 +1430,16 @@  config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY
+	bool "Support non-standard NVDIMMs and ADR protected memory"
+	help
+	  Treat memory marked using the non-standard e820 type of 12 as used
+	  by the Intel Sandy Bridge-EP reference BIOS as protected memory.
+	  The kernel will offer these regions to the pmem driver so
+	  they can be used for persistent storage.
+
+	  Say Y if unsure.
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index ff4e7b2..2352fde 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,12 @@  extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+#ifdef CONFIG_X86_PMEM_LEGACY
+void reserve_pmem(void);
+#else
+static inline void reserve_pmem(void) { }
+#endif
+
 #ifndef _SETUP
 
 #include <asm/espfix.h>
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index d993e33..ce0d0bf 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -33,6 +33,16 @@ 
 #define E820_NVS	4
 #define E820_UNUSABLE	5
 
+/*
+ * This is a non-standardized way to represent ADR or NVDIMM regions that
+ * persist over a reboot.  The kernel will ignore their special capabilities
+ * unless the CONFIG_X86_PMEM_LEGACY option is set.
+ *
+ * Note that older platforms also used 6 for the same type of memory,
+ * but newer versions switched to 12 as 6 was assigned differently.  Some
+ * time they will learn..
+ */
+#define E820_PRAM	12
 
 /*
  * reserved RAM used by kernel itself
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cdb1b70..971f18c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -94,6 +94,7 @@  obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
+obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 46201de..4bd525a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -149,6 +149,9 @@  static void __init e820_print_type(u32 type)
 	case E820_UNUSABLE:
 		printk(KERN_CONT "unusable");
 		break;
+	case E820_PRAM:
+		printk(KERN_CONT "persistent (type %u)", type);
+		break;
 	default:
 		printk(KERN_CONT "type %u", type);
 		break;
@@ -688,8 +691,15 @@  void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 			register_nosave_region(pfn, PFN_UP(ei->addr));
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
-		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
+
+		switch (ei->type) {
+		case E820_RAM:
+		case E820_PRAM:
+		case E820_RESERVED_KERN:
+			break;
+		default:
 			register_nosave_region(PFN_UP(ei->addr), pfn);
+		}
 
 		if (pfn >= limit_pfn)
 			break;
@@ -748,7 +758,7 @@  u64 __init early_reserve_e820(u64 size, u64 align)
 /*
  * Find the highest page frame number we have available
  */
-static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
+static unsigned long __init e820_end_pfn(unsigned long limit_pfn)
 {
 	int i;
 	unsigned long last_pfn = 0;
@@ -759,7 +769,11 @@  static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 		unsigned long start_pfn;
 		unsigned long end_pfn;
 
-		if (ei->type != type)
+		/*
+		 * Persistent memory is accounted as ram for purposes of
+		 * establishing max_pfn and mem_map.
+		 */
+		if (ei->type != E820_RAM && ei->type != E820_PRAM)
 			continue;
 
 		start_pfn = ei->addr >> PAGE_SHIFT;
@@ -784,12 +798,12 @@  static unsigned long __init e820_end_pfn(unsigned long limit_pfn, unsigned type)
 }
 unsigned long __init e820_end_of_ram_pfn(void)
 {
-	return e820_end_pfn(MAX_ARCH_PFN, E820_RAM);
+	return e820_end_pfn(MAX_ARCH_PFN);
 }
 
 unsigned long __init e820_end_of_low_ram_pfn(void)
 {
-	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT), E820_RAM);
+	return e820_end_pfn(1UL<<(32 - PAGE_SHIFT));
 }
 
 static void early_panic(char *msg)
@@ -866,6 +880,9 @@  static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PRAM);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -907,6 +924,7 @@  static inline const char *e820_type_to_string(int e820_type)
 	case E820_ACPI:	return "ACPI Tables";
 	case E820_NVS:	return "ACPI Non-volatile Storage";
 	case E820_UNUSABLE:	return "Unusable memory";
+	case E820_PRAM: return "Persistent RAM";
 	default:	return "reserved";
 	}
 }
@@ -941,7 +959,8 @@  void __init e820_reserve_resources(void)
 		 * pcibios_resource_survey()
 		 */
 		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			res->flags |= IORESOURCE_BUSY;
+			if (e820.map[i].type != E820_PRAM)
+				res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
new file mode 100644
index 0000000..f970048
--- /dev/null
+++ b/arch/x86/kernel/pmem.c
@@ -0,0 +1,70 @@ 
+/*
+ * Copyright (c) 2009, Intel Corporation.
+ * Copyright (c) 2015, Christoph Hellwig.
+ */
+#include <linux/memblock.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <asm/e820.h>
+#include <asm/page_types.h>
+#include <asm/setup.h>
+
+void __init reserve_pmem(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type != E820_PRAM)
+			continue;
+
+		memblock_reserve(ei->addr, ei->addr + ei->size);
+		max_pfn_mapped = init_memory_mapping(
+				ei->addr < 1UL << 32 ? 1UL << 32 : ei->addr,
+				ei->addr + ei->size);
+	}
+}
+
+static __init void register_pmem_device(struct resource *res)
+{
+	struct platform_device *pdev;
+	int error;
+
+	pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
+	if (!pdev)
+		return;
+
+	error = platform_device_add_resources(pdev, res, 1);
+	if (error)
+		goto out_put_pdev;
+
+	error = platform_device_add(pdev);
+	if (error)
+		goto out_put_pdev;
+	return;
+out_put_pdev:
+	dev_warn(&pdev->dev, "failed to add pmem device!\n");
+	platform_device_put(pdev);
+}
+
+static __init int register_pmem_devices(void)
+{
+	int i;
+
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *ei = &e820.map[i];
+
+		if (ei->type == E820_PRAM) {
+			struct resource res = {
+				.flags	= IORESOURCE_MEM,
+				.start	= ei->addr,
+				.end	= ei->addr + ei->size - 1,
+			};
+			register_pmem_device(&res);
+		}
+	}
+
+	return 0;
+}
+device_initcall(register_pmem_devices);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 0a2421c..f2bed2b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1158,6 +1158,8 @@  void __init setup_arch(char **cmdline_p)
 
 	early_acpi_boot_init();
 
+	reserve_pmem();
+
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);