--- /dev/null
-* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
+ .. SPDX-License-Identifier: GPL-2.0
+
+ =============
+ Kernel Stacks
+ =============
+
+ Kernel stacks on x86-64 bit
+ ===========================
+
+ Most of the text from Keith Owens, hacked by AK
+
+ x86_64 page size (PAGE_SIZE) is 4K.
+
+ Like all other architectures, x86_64 has a kernel stack for every
+ active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
+ These stacks contain useful data as long as a thread is alive or a
+ zombie. While the thread is in user space the kernel stack is empty
+ except for the thread_info structure at the bottom.
+
+ In addition to the per thread stacks, there are specialized stacks
+ associated with each CPU. These stacks are only used while the kernel
+ is in control on that CPU; when a CPU returns to user space the
+ specialized stacks contain no useful data. The main CPU stacks are:
+
+ * Interrupt stack. IRQ_STACK_SIZE
+
+ Used for external hardware interrupts. If this is the first external
+ hardware interrupt (i.e. not a nested hardware interrupt) then the
+ kernel switches from the current task to the interrupt stack. Like
+ the split thread and interrupt stacks on i386, this gives more room
+ for kernel interrupt processing without having to increase the size
+ of every per thread stack.
+
+ The interrupt stack is also used when processing a softirq.
+
+ Switching to the kernel interrupt stack is done by software based on a
+ per CPU interrupt nest counter. This is needed because x86-64 "IST"
+ hardware stacks cannot nest without races.
+
+ x86_64 also has a feature which is not available on i386, the ability
+ to automatically switch to a new stack for designated events such as
+ double fault or NMI, which makes it easier to handle these unusual
+ events on x86_64. This feature is called the Interrupt Stack Table
+ (IST). There can be up to 7 IST entries per CPU. The IST code is an
+ index into the Task State Segment (TSS). The IST entries in the TSS
+ point to dedicated stacks; each stack can be a different size.
+
+ An IST is selected by a non-zero value in the IST field of an
+ interrupt-gate descriptor. When an interrupt occurs and the hardware
+ loads such a descriptor, the hardware automatically sets the new stack
+ pointer based on the IST value, then invokes the interrupt handler. If
+ the interrupt came from user mode, then the interrupt handler prologue
+ will switch back to the per-thread stack. If software wants to allow
+ nested IST interrupts then the handler must adjust the IST values on
+ entry to and exit from the interrupt handler. (This is occasionally
+ done, e.g. for debug exceptions.)
+
+ Events with different IST codes (i.e. with different stacks) can be
+ nested. For example, a debug interrupt can safely be interrupted by an
+ NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
+ pointers on entry to and exit from all IST events, in theory allowing
+ IST events with the same code to be nested. However in most cases, the
+ stack size allocated to an IST assumes no nesting for the same code.
+ If that assumption is ever broken then the stacks will become corrupt.
+
+ The currently assigned IST stacks are:
+
-* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
++* ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE).
+
+ Used for interrupt 8 - Double Fault Exception (#DF).
+
+ Invoked when handling one exception causes another exception. Happens
+ when the kernel is very confused (e.g. kernel stack pointer corrupt).
+ Using a separate stack allows the kernel to recover from it well enough
+ in many cases to still output an oops.
+
-* DEBUG_STACK. DEBUG_STKSZ
++* ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE).
+
+ Used for non-maskable interrupts (NMI).
+
+ NMI can be delivered at any time, including when the kernel is in the
+ middle of switching stacks. Using IST for NMI events avoids making
+ assumptions about the previous state of the kernel stack.
+
-* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
++* ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE).
+
+ Used for hardware debug interrupts (interrupt 1) and for software
+ debug interrupts (INT3).
+
+ When debugging a kernel, debug interrupts (both hardware and
+ software) can occur at any time. Using IST for these interrupts
+ avoids making assumptions about the previous state of the kernel
+ stack.
+
++ To handle nested #DB correctly there exist two instances of DB stacks. On
++ #DB entry the IST stackpointer for #DB is switched to the second instance
++ so a nested #DB starts from a clean stack. The nested #DB switches
++ the IST stackpointer to a guard hole to catch triple nesting.
++
++* ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE).
+
+ Used for interrupt 18 - Machine Check Exception (#MC).
+
+ MCE can be delivered at any time, including when the kernel is in the
+ middle of switching stacks. Using IST for MCE events avoids making
+ assumptions about the previous state of the kernel stack.
+
+ For more details see the Intel IA32 or AMD AMD64 architecture manuals.
+
+
+ Printing backtraces on x86
+ ==========================
+
+ The question about the '?' preceding function names in an x86 stacktrace
+ keeps popping up, here's an indepth explanation. It helps if the reader
+ stares at print_context_stack() and the whole machinery in and around
+ arch/x86/kernel/dumpstack.c.
+
+ Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>:
+
+ We always scan the full kernel stack for return addresses stored on
+ the kernel stack(s) [1]_, from stack top to stack bottom, and print out
+ anything that 'looks like' a kernel text address.
+
+ If it fits into the frame pointer chain, we print it without a question
+ mark, knowing that it's part of the real backtrace.
+
+ If the address does not fit into our expected frame pointer chain we
+ still print it, but we print a '?'. It can mean two things:
+
+ - either the address is not part of the call chain: it's just stale
+ values on the kernel stack, from earlier function calls. This is
+ the common case.
+
+ - or it is part of the call chain, but the frame pointer was not set
+ up properly within the function, so we don't recognize it.
+
+ This way we will always print out the real call chain (plus a few more
+ entries), regardless of whether the frame pointer was set up correctly
+ or not - but in most cases we'll get the call chain right as well. The
+ entries printed are strictly in stack order, so you can deduce more
+ information from that as well.
+
+ The most important property of this method is that we _never_ lose
+ information: we always strive to print _all_ addresses on the stack(s)
+ that look like kernel text addresses, so if debug information is wrong,
+ we still print out the real call chain as well - just with more question
+ marks than ideal.
+
+ .. [1] For things like IRQ and IST stacks, we also scan those stacks, in
+ the right order, and try to cross from one stack into another
+ reconstructing the call chain. This works most of the time.
--- /dev/null
- - cpuinfo_x86.logical_id:
+ .. SPDX-License-Identifier: GPL-2.0
+
+ ============
+ x86 Topology
+ ============
+
+ This documents and clarifies the main aspects of x86 topology modelling and
+ representation in the kernel. Update/change when doing changes to the
+ respective code.
+
+ The architecture-agnostic topology definitions are in
+ Documentation/cputopology.txt. This file holds x86-specific
+ differences/specialities which must not necessarily apply to the generic
+ definitions. Thus, the way to read up on Linux topology on x86 is to start
+ with the generic one and look at this one in parallel for the x86 specifics.
+
+ Needless to say, code should use the generic functions - this file is *only*
+ here to *document* the inner workings of x86 topology.
+
+ Started by Thomas Gleixner <tglx@linutronix.de> and Borislav Petkov <bp@alien8.de>.
+
+ The main aim of the topology facilities is to present adequate interfaces to
+ code which needs to know/query/use the structure of the running system wrt
+ threads, cores, packages, etc.
+
+ The kernel does not care about the concept of physical sockets because a
+ socket has no relevance to software. It's an electromechanical component. In
+ the past a socket always contained a single package (see below), but with the
+ advent of Multi Chip Modules (MCM) a socket can hold more than one package. So
+ there might be still references to sockets in the code, but they are of
+ historical nature and should be cleaned up.
+
+ The topology of a system is described in the units of:
+
+ - packages
+ - cores
+ - threads
+
+ Package
+ =======
+ Packages contain a number of cores plus shared resources, e.g. DRAM
+ controller, shared caches etc.
+
+ AMD nomenclature for package is 'Node'.
+
+ Package-related topology information in the kernel:
+
+ - cpuinfo_x86.x86_max_cores:
+
+ The number of cores in a package. This information is retrieved via CPUID.
+
+ - cpuinfo_x86.phys_proc_id:
+
+ The physical ID of the package. This information is retrieved via CPUID
+ and deduced from the APIC IDs of the cores in the package.
+
++ - cpuinfo_x86.logical_proc_id:
+
+ The logical ID of the package. As we do not trust BIOSes to enumerate the
+ packages in a consistent way, we introduced the concept of logical package
+ ID so we can sanely calculate the number of maximum possible packages in
+ the system and have the packages enumerated linearly.
+
+ - topology_max_packages():
+
+ The maximum possible number of packages in the system. Helpful for per
+ package facilities to preallocate per package information.
+
+ - cpu_llc_id:
+
+ A per-CPU variable containing:
+
+ - On Intel, the first APIC ID of the list of CPUs sharing the Last Level
+ Cache
+
+ - On AMD, the Node ID or Core Complex ID containing the Last Level
+ Cache. In general, it is a number identifying an LLC uniquely on the
+ system.
+
+ Cores
+ =====
+ A core consists of 1 or more threads. It does not matter whether the threads
+ are SMT- or CMT-type threads.
+
+ AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
+ "core".
+
+ Core-related topology information in the kernel:
+
+ - smp_num_siblings:
+
+ The number of threads in a core. The number of threads in a package can be
+ calculated by::
+
+ threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
+
+
+ Threads
+ =======
+ A thread is a single scheduling unit. It's the equivalent to a logical Linux
+ CPU.
+
+ AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always
+ uses "thread".
+
+ Thread-related topology information in the kernel:
+
+ - topology_core_cpumask():
+
+ The cpumask contains all online threads in the package to which a thread
+ belongs.
+
+ The number of online threads is also printed in /proc/cpuinfo "siblings."
+
+ - topology_sibling_cpumask():
+
+ The cpumask contains all online threads in the core to which a thread
+ belongs.
+
+ - topology_logical_package_id():
+
+ The logical package ID to which a thread belongs.
+
+ - topology_physical_package_id():
+
+ The physical package ID to which a thread belongs.
+
+ - topology_core_id();
+
+ The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
+ "core_id."
+
+
+
+ System topology examples
+ ========================
+
+ .. note::
+ The alternative Linux CPU enumeration depends on how the BIOS enumerates the
+ threads. Many BIOSes enumerate all threads 0 first and then all threads 1.
+ That has the "advantage" that the logical Linux CPU numbers of threads 0 stay
+ the same whether threads are enabled or not. That's merely an implementation
+ detail and has no practical impact.
+
+ 1) Single Package, Single Core::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+
+ 2) Single Package, Dual Core
+
+ a) One thread per core::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [core 1] -> [thread 0] -> Linux CPU 1
+
+ b) Two threads per core::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [thread 1] -> Linux CPU 1
+ -> [core 1] -> [thread 0] -> Linux CPU 2
+ -> [thread 1] -> Linux CPU 3
+
+ Alternative enumeration::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [thread 1] -> Linux CPU 2
+ -> [core 1] -> [thread 0] -> Linux CPU 1
+ -> [thread 1] -> Linux CPU 3
+
+ AMD nomenclature for CMT systems::
+
+ [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
+ -> [Compute Unit Core 1] -> Linux CPU 1
+ -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
+ -> [Compute Unit Core 1] -> Linux CPU 3
+
+ 4) Dual Package, Dual Core
+
+ a) One thread per core::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [core 1] -> [thread 0] -> Linux CPU 1
+
+ [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
+ -> [core 1] -> [thread 0] -> Linux CPU 3
+
+ b) Two threads per core::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [thread 1] -> Linux CPU 1
+ -> [core 1] -> [thread 0] -> Linux CPU 2
+ -> [thread 1] -> Linux CPU 3
+
+ [package 1] -> [core 0] -> [thread 0] -> Linux CPU 4
+ -> [thread 1] -> Linux CPU 5
+ -> [core 1] -> [thread 0] -> Linux CPU 6
+ -> [thread 1] -> Linux CPU 7
+
+ Alternative enumeration::
+
+ [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0
+ -> [thread 1] -> Linux CPU 4
+ -> [core 1] -> [thread 0] -> Linux CPU 1
+ -> [thread 1] -> Linux CPU 5
+
+ [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2
+ -> [thread 1] -> Linux CPU 6
+ -> [core 1] -> [thread 0] -> Linux CPU 3
+ -> [thread 1] -> Linux CPU 7
+
+ AMD nomenclature for CMT systems::
+
+ [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
+ -> [Compute Unit Core 1] -> Linux CPU 1
+ -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
+ -> [Compute Unit Core 1] -> Linux CPU 3
+
+ [node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4
+ -> [Compute Unit Core 1] -> Linux CPU 5
+ -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6
+ -> [Compute Unit Core 1] -> Linux CPU 7
--- /dev/null
- from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PT starting
+ .. SPDX-License-Identifier: GPL-2.0
+
+ ================
+ Memory Managment
+ ================
+
+ Complete virtual memory map with 4-level page tables
+ ====================================================
+
+ .. note::
+
+ - Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
+ from the top of the 64-bit address space. It's easier to understand the layout
+ when seen both in absolute addresses and in distance-from-top notation.
+
+ For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
+ 64-bit address space (ffffffffffffffff).
+
+ Note that as we get closer to the top of the address space, the notation changes
+ from TB to GB and then MB/KB.
+
+ - "16M TB" might look weird at first sight, but it's an easier to visualize size
+ notation than "16 EB", which few will recognize at first sight as 16 exabytes.
+ It also shows it nicely how incredibly large 64-bit address space is.
+
+ ::
+
+ ========================================================================================================================
+ Start addr | Offset | End addr | Size | VM area description
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -128 TB
+ | | | | starting offset of kernel mappings.
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | Kernel-space virtual memory, shared between all processes:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
+ ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
+ ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
+ ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
+ ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
+ ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
+ ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
+ ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
+ ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ | Identical layout to the 56-bit one from here on:
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
+ | | | | vaddr_end for KASLR
+ fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
+ ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
+ ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
+ ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
+ ffffffff80000000 |-2048 MB | | |
+ ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
+ ffffffffff000000 | -16 MB | | |
+ FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+ ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
+ ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
+ __________________|____________|__________________|_________|___________________________________________________________
+
+
+ Complete virtual memory map with 5-level page tables
+ ====================================================
+
+ .. note::
+
+ - With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
- 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
++ from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PB starting
+ offset and many of the regions expand to support the much larger physical
+ memory supported.
+
+ ::
+
+ ========================================================================================================================
+ Start addr | Offset | End addr | Size | VM area description
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
- ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory
++ 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -64 PB
+ | | | | starting offset of kernel mappings.
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | Kernel-space virtual memory, shared between all processes:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
+ ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
+ ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
+ ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
+ ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
+ ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
+ ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
+ ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
++ ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ | Identical layout to the 47-bit one from here on:
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
+ | | | | vaddr_end for KASLR
+ fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
+ fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
+ ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
+ ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
+ ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
+ ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
+ ffffffff80000000 |-2048 MB | | |
+ ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
+ ffffffffff000000 | -16 MB | | |
+ FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
+ ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
+ ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
+ __________________|____________|__________________|_________|___________________________________________________________
+
+ Architecture defines a 64-bit virtual address. Implementations can support
+ less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
+ through to the most-significant implemented bit are sign extended.
+ This causes hole between user space and kernel addresses if you interpret them
+ as unsigned.
+
+ The direct mapping covers all memory in the system up to the highest
+ memory address (this means in some cases it can also include PCI memory
+ holes).
+
+ vmalloc space is lazily synchronized into the different PML4/PML5 pages of
+ the processes using the page fault handler, with init_top_pgt as
+ reference.
+
+ We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
+ memory window (this size is arbitrary, it can be raised later if needed).
+ The mappings are not part of any other kernel PGD and are only available
+ during EFI runtime calls.
+
+ Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
+ physical memory, vmalloc/ioremap space and virtual memory map are randomized.
+ Their order is preserved but their base will be offset early at boot time.
+
+ Be very careful vs. KASLR when changing anything here. The KASLR address
+ range must not overlap with anything except the KASAN shadow area, which is
+ correct as KASAN disables KASLR.
+
+ For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
+ hole: ffffffffffff4111