mm/sparsemem: introduce struct mem_section_usage
authorDan Williams <dan.j.williams@intel.com>
Thu, 18 Jul 2019 22:57:57 +0000 (15:57 -0700)
committerLinus Torvalds <torvalds@linux-foundation.org>
Fri, 19 Jul 2019 00:08:07 +0000 (17:08 -0700)
Patch series "mm: Sub-section memory hotplug support", v10.

The memory hotplug section is an arbitrary / convenient unit for memory
hotplug.  'Section-size' units have bled into the user interface
('memblock' sysfs) and can not be changed without breaking existing
userspace.  The section-size constraint, while mostly benign for typical
memory hotplug, has and continues to wreak havoc with 'device-memory'
use cases, persistent memory (pmem) in particular.  Recall that pmem
uses devm_memremap_pages(), and subsequently arch_add_memory(), to
allocate a 'struct page' memmap for pmem.  However, it does not use the
'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
never exposes the userspace memblock interface for pmem.  This leaves an
opening to redress the section-size constraint.

To date, the libnvdimm subsystem has attempted to inject padding to
satisfy the internal constraints of arch_add_memory().  Beyond
complicating the code, leading to bugs [2], wasting memory, and limiting
configuration flexibility, the padding hack is broken when the platform
changes this physical memory alignment of pmem from one boot to the
next.  Device failure (intermittent or permanent) and physical
reconfiguration are events that can cause the platform firmware to
change the physical placement of pmem on a subsequent boot, and device
failure is an everyday event in a data-center.

It turns out that sections are only a hard requirement of the
user-facing interface for memory hotplug and with a bit more
infrastructure sub-section arch_add_memory() support can be added for
kernel internal usages like devm_memremap_pages().  Here is an analysis
of the current design assumptions in the current code and how they are
addressed in the new implementation:

Current design assumptions:

 - Sections that describe boot memory (early sections) are never
   unplugged / removed.

 - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
   valid_section() check

 - __add_pages() and helper routines assume all operations occur in
   PAGES_PER_SECTION units.

 - The memblock sysfs interface only comprehends full sections

New design assumptions:

 - Sections are instrumented with a sub-section bitmask to track (on
   x86) individual 2MB sub-divisions of a 128MB section.

 - Partially populated early sections can be extended with additional
   sub-sections, and those sub-sections can be removed with
   arch_remove_memory(). With this in place we no longer lose usable
   memory capacity to padding.

 - pfn_valid() is updated to look deeper than valid_section() to also
   check the active-sub-section mask. This indication is in the same
   cacheline as the valid_section() so the performance impact is
   expected to be negligible. So far the lkp robot has not reported any
   regressions.

 - Outside of the core vmemmap population routines which are replaced,
   other helper routines like shrink_{zone,pgdat}_span() are updated to
   handle the smaller granularity. Core memory hotplug routines that
   deal with online memory are not touched.

 - The existing memblock sysfs user api guarantees / assumptions are not
   touched since this capability is limited to !online
   !memblock-sysfs-accessible sections.

Meanwhile the issue reports continue to roll in from users that do not
understand when and how the 128MB constraint will bite them.  The current
implementation relied on being able to support at least one misaligned
namespace, but that immediately falls over on any moderately complex
namespace creation attempt.  Beyond the initial problem of 'System RAM'
colliding with pmem, and the unsolvable problem of physical alignment
changes, Linux is now being exposed to platforms that collide pmem ranges
with other pmem ranges by default [3].  In short, devm_memremap_pages()
has pushed the venerable section-size constraint past the breaking point,
and the simplicity of section-aligned arch_add_memory() is no longer
tenable.

These patches are exposed to the kbuild robot on a subsection-v10 branch
[4], and a preview of the unit test for this functionality is available
on the 'subsection-pending' branch of ndctl [5].

[2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
[3]: https://github.com/pmem/ndctl/issues/76
[4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
[5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

This patch (of 13):

Towards enabling memory hotplug to track partial population of a section,
introduce 'struct mem_section_usage'.

A pointer to a 'struct mem_section_usage' instance replaces the existing
pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
a new 'subsection_map' bitmap.  The new bitmap enables the memory
hot{plug,remove} implementation to act on incremental sub-divisions of a
section.

SUBSECTION_SHIFT is defined as global constant instead of per-architecture
value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
subsection users.  Specifically a common subsection size allows for the
possibility that persistent memory namespace configurations be made
compatible across architectures.

The primary motivation for this functionality is to support platforms that
mix "System RAM" and "Persistent Memory" within a single section, or
multiple PMEM ranges with different mapping lifetimes within a single
section.  The section restriction for hotplug has caused an ongoing saga
of hacks and bugs for devm_memremap_pages() users.

Beyond the fixups to teach existing paths how to retrieve the 'usemap'
from a section, and updates to usemap allocation path, there are no
expected behavior changes.

Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/mmzone.h
mm/memory_hotplug.c
mm/page_alloc.c
mm/sparse.c

index 298d1c3e4c2e23d63ffe65ec71bacceab7d5b5af..2520336bdfd1894abf9aa2ef93847aab0cc442b0 100644 (file)
@@ -1160,6 +1160,24 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
 #define SECTION_ALIGN_UP(pfn)  (((pfn) + PAGES_PER_SECTION - 1) & PAGE_SECTION_MASK)
 #define SECTION_ALIGN_DOWN(pfn)        ((pfn) & PAGE_SECTION_MASK)
 
+#define SUBSECTION_SHIFT 21
+
+#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
+#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
+#define PAGE_SUBSECTION_MASK (~(PAGES_PER_SUBSECTION-1))
+
+#if SUBSECTION_SHIFT > SECTION_SIZE_BITS
+#error Subsection size exceeds section size
+#else
+#define SUBSECTIONS_PER_SECTION (1UL << (SECTION_SIZE_BITS - SUBSECTION_SHIFT))
+#endif
+
+struct mem_section_usage {
+       DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
+       /* See declaration of similar field in struct zone */
+       unsigned long pageblock_flags[0];
+};
+
 struct page;
 struct page_ext;
 struct mem_section {
@@ -1177,8 +1195,7 @@ struct mem_section {
         */
        unsigned long section_mem_map;
 
-       /* See declaration of similar field in struct zone */
-       unsigned long *pageblock_flags;
+       struct mem_section_usage *usage;
 #ifdef CONFIG_PAGE_EXTENSION
        /*
         * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
@@ -1209,6 +1226,11 @@ extern struct mem_section **mem_section;
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
+static inline unsigned long *section_to_usemap(struct mem_section *ms)
+{
+       return ms->usage->pageblock_flags;
+}
+
 static inline struct mem_section *__nr_to_section(unsigned long nr)
 {
 #ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1220,7 +1242,7 @@ static inline struct mem_section *__nr_to_section(unsigned long nr)
        return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
 }
 extern unsigned long __section_nr(struct mem_section *ms);
-extern unsigned long usemap_size(void);
+extern size_t mem_section_usage_size(void);
 
 /*
  * We use the lower bits of the mem_map pointer to store
index fafee5f13ef23ce3ba10226225d5e8f260f542b7..cf9d979a6498b5e5e68777194b44c54aa0b857dc 100644 (file)
@@ -166,9 +166,10 @@ void put_page_bootmem(struct page *page)
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
-       unsigned long *usemap, mapsize, section_nr, i;
+       unsigned long mapsize, section_nr, i;
        struct mem_section *ms;
        struct page *page, *memmap;
+       struct mem_section_usage *usage;
 
        section_nr = pfn_to_section_nr(start_pfn);
        ms = __nr_to_section(section_nr);
@@ -188,10 +189,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
        for (i = 0; i < mapsize; i++, page++)
                get_page_bootmem(section_nr, page, SECTION_INFO);
 
-       usemap = ms->pageblock_flags;
-       page = virt_to_page(usemap);
+       usage = ms->usage;
+       page = virt_to_page(usage);
 
-       mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+       mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
 
        for (i = 0; i < mapsize; i++, page++)
                get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
@@ -200,9 +201,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 #else /* CONFIG_SPARSEMEM_VMEMMAP */
 static void register_page_bootmem_info_section(unsigned long start_pfn)
 {
-       unsigned long *usemap, mapsize, section_nr, i;
+       unsigned long mapsize, section_nr, i;
        struct mem_section *ms;
        struct page *page, *memmap;
+       struct mem_section_usage *usage;
 
        section_nr = pfn_to_section_nr(start_pfn);
        ms = __nr_to_section(section_nr);
@@ -211,10 +213,10 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
 
        register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
 
-       usemap = ms->pageblock_flags;
-       page = virt_to_page(usemap);
+       usage = ms->usage;
+       page = virt_to_page(usage);
 
-       mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+       mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
 
        for (i = 0; i < mapsize; i++, page++)
                get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
index e515bfcf7f288fbea352cbafc7daf4f0f2320ca3..be78bafbfe3a4e4b7ee6d2e27b66b52824f1c476 100644 (file)
@@ -450,7 +450,7 @@ static inline unsigned long *get_pageblock_bitmap(struct page *page,
                                                        unsigned long pfn)
 {
 #ifdef CONFIG_SPARSEMEM
-       return __pfn_to_section(pfn)->pageblock_flags;
+       return section_to_usemap(__pfn_to_section(pfn));
 #else
        return page_zone(page)->pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
index b29534cea8c023b54538429be087bef993c32ff9..41bef8e1f65c192427f19265fb864727fa7992fd 100644 (file)
@@ -288,33 +288,31 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
 
 static void __meminit sparse_init_one_section(struct mem_section *ms,
                unsigned long pnum, struct page *mem_map,
-               unsigned long *pageblock_bitmap)
+               struct mem_section_usage *usage)
 {
        ms->section_mem_map &= ~SECTION_MAP_MASK;
        ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
                                                        SECTION_HAS_MEM_MAP;
-       ms->pageblock_flags = pageblock_bitmap;
+       ms->usage = usage;
 }
 
-unsigned long usemap_size(void)
+static unsigned long usemap_size(void)
 {
        return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-static unsigned long *__kmalloc_section_usemap(void)
+size_t mem_section_usage_size(void)
 {
-       return kmalloc(usemap_size(), GFP_KERNEL);
+       return sizeof(struct mem_section_usage) + usemap_size();
 }
-#endif /* CONFIG_MEMORY_HOTPLUG */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static unsigned long * __init
+static struct mem_section_usage * __init
 sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
                                         unsigned long size)
 {
+       struct mem_section_usage *usage;
        unsigned long goal, limit;
-       unsigned long *p;
        int nid;
        /*
         * A page may contain usemaps for other sections preventing the
@@ -330,15 +328,16 @@ sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
        limit = goal + (1UL << PA_SECTION_SHIFT);
        nid = early_pfn_to_nid(goal >> PAGE_SHIFT);
 again:
-       p = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
-       if (!p && limit) {
+       usage = memblock_alloc_try_nid(size, SMP_CACHE_BYTES, goal, limit, nid);
+       if (!usage && limit) {
                limit = 0;
                goto again;
        }
-       return p;
+       return usage;
 }
 
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+               struct mem_section_usage *usage)
 {
        unsigned long usemap_snr, pgdat_snr;
        static unsigned long old_usemap_snr;
@@ -352,7 +351,7 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
                old_pgdat_snr = NR_MEM_SECTIONS;
        }
 
-       usemap_snr = pfn_to_section_nr(__pa(usemap) >> PAGE_SHIFT);
+       usemap_snr = pfn_to_section_nr(__pa(usage) >> PAGE_SHIFT);
        pgdat_snr = pfn_to_section_nr(__pa(pgdat) >> PAGE_SHIFT);
        if (usemap_snr == pgdat_snr)
                return;
@@ -380,14 +379,15 @@ static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
                usemap_snr, pgdat_snr, nid);
 }
 #else
-static unsigned long * __init
+static struct mem_section_usage * __init
 sparse_early_usemaps_alloc_pgdat_section(struct pglist_data *pgdat,
                                         unsigned long size)
 {
        return memblock_alloc_node(size, SMP_CACHE_BYTES, pgdat->node_id);
 }
 
-static void __init check_usemap_section_nr(int nid, unsigned long *usemap)
+static void __init check_usemap_section_nr(int nid,
+               struct mem_section_usage *usage)
 {
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -474,14 +474,13 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
                                   unsigned long pnum_end,
                                   unsigned long map_count)
 {
-       unsigned long pnum, usemap_longs, *usemap;
+       struct mem_section_usage *usage;
+       unsigned long pnum;
        struct page *map;
 
-       usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
-       usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
-                                                         usemap_size() *
-                                                         map_count);
-       if (!usemap) {
+       usage = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
+                       mem_section_usage_size() * map_count);
+       if (!usage) {
                pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
                goto failed;
        }
@@ -497,9 +496,9 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
                        pnum_begin = pnum;
                        goto failed;
                }
-               check_usemap_section_nr(nid, usemap);
-               sparse_init_one_section(__nr_to_section(pnum), pnum, map, usemap);
-               usemap += usemap_longs;
+               check_usemap_section_nr(nid, usage);
+               sparse_init_one_section(__nr_to_section(pnum), pnum, map, usage);
+               usage = (void *) usage + mem_section_usage_size();
        }
        sparse_buffer_fini();
        return;
@@ -697,9 +696,9 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
                                     struct vmem_altmap *altmap)
 {
        unsigned long section_nr = pfn_to_section_nr(start_pfn);
+       struct mem_section_usage *usage;
        struct mem_section *ms;
        struct page *memmap;
-       unsigned long *usemap;
        int ret;
 
        /*
@@ -713,8 +712,8 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
        memmap = kmalloc_section_memmap(section_nr, nid, altmap);
        if (!memmap)
                return -ENOMEM;
-       usemap = __kmalloc_section_usemap();
-       if (!usemap) {
+       usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
+       if (!usage) {
                __kfree_section_memmap(memmap, altmap);
                return -ENOMEM;
        }
@@ -733,11 +732,11 @@ int __meminit sparse_add_one_section(int nid, unsigned long start_pfn,
 
        set_section_nid(section_nr, nid);
        section_mark_present(ms);
-       sparse_init_one_section(ms, section_nr, memmap, usemap);
+       sparse_init_one_section(ms, section_nr, memmap, usage);
 
 out:
        if (ret < 0) {
-               kfree(usemap);
+               kfree(usage);
                __kfree_section_memmap(memmap, altmap);
        }
        return ret;
@@ -773,20 +772,20 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 }
 #endif
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap,
-               struct vmem_altmap *altmap)
+static void free_section_usage(struct page *memmap,
+               struct mem_section_usage *usage, struct vmem_altmap *altmap)
 {
-       struct page *usemap_page;
+       struct page *usage_page;
 
-       if (!usemap)
+       if (!usage)
                return;
 
-       usemap_page = virt_to_page(usemap);
+       usage_page = virt_to_page(usage);
        /*
         * Check to see if allocation came from hot-plug-add
         */
-       if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
-               kfree(usemap);
+       if (PageSlab(usage_page) || PageCompound(usage_page)) {
+               kfree(usage);
                if (memmap)
                        __kfree_section_memmap(memmap, altmap);
                return;
@@ -805,18 +804,18 @@ void sparse_remove_one_section(struct mem_section *ms, unsigned long map_offset,
                               struct vmem_altmap *altmap)
 {
        struct page *memmap = NULL;
-       unsigned long *usemap = NULL;
+       struct mem_section_usage *usage = NULL;
 
        if (ms->section_mem_map) {
-               usemap = ms->pageblock_flags;
+               usage = ms->usage;
                memmap = sparse_decode_mem_map(ms->section_mem_map,
                                                __section_nr(ms));
                ms->section_mem_map = 0;
-               ms->pageblock_flags = NULL;
+               ms->usage = NULL;
        }
 
        clear_hwpoisoned_pages(memmap + map_offset,
                        PAGES_PER_SECTION - map_offset);
-       free_section_usemap(memmap, usemap, altmap);
+       free_section_usage(memmap, usage, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */