Huaitong Han [Tue, 12 Jan 2016 08:04:20 +0000 (16:04 +0800)]
kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
vmx_cpuid_tries to update SECONDARY_VM_EXEC_CONTROL in the VMCS, but
it will cause a vmwrite error on older CPUs because the code does not
check for the presence of CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.
This will get rid of the following trace on e.g. Core2 6600:
vmwrite error: reg 401e value 10 (err 12)
Call Trace:
[<
ffffffff8116e2b9>] dump_stack+0x40/0x57
[<
ffffffffa020b88d>] vmx_cpuid_update+0x5d/0x150 [kvm_intel]
[<
ffffffffa01d8fdc>] kvm_vcpu_ioctl_set_cpuid2+0x4c/0x70 [kvm]
[<
ffffffffa01b8363>] kvm_arch_vcpu_ioctl+0x903/0xfa0 [kvm]
Fixes: feda805fe7c4ed9cf78158e73b1218752e3b4314
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Wed, 23 Dec 2015 13:54:00 +0000 (16:54 +0300)]
kvm/x86: Hyper-V SynIC timers tracepoints
Trace the following Hyper SynIC timers events:
* periodic timer start
* one-shot timer start
* timer callback
* timer expiration and message delivery result
* timer config setup
* timer count setup
* timer cleanup
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Wed, 23 Dec 2015 13:53:59 +0000 (16:53 +0300)]
kvm/x86: Hyper-V SynIC tracepoints
Trace the following Hyper SynIC events:
* set msr
* set sint irq
* ack sint
* sint irq eoi
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:24 +0000 (18:27 +0300)]
kvm/x86: Update SynIC timers on guest entry only
Consolidate updating the Hyper-V SynIC timers in a
single place: on guest entry in processing KVM_REQ_HV_STIMER
request. This simplifies the overall logic, and makes sure
the most current state of msrs and guest clock is used for
arming the timers (to achieve that, KVM_REQ_HV_STIMER
has to be processed after KVM_REQ_CLOCK_UPDATE).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:23 +0000 (18:27 +0300)]
kvm/x86: Skip SynIC vector check for QEMU side
QEMU zero-inits Hyper-V SynIC vectors. We should allow that,
and don't reject zero values if set by the host.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:22 +0000 (18:27 +0300)]
kvm/x86: Hyper-V fix SynIC timer disabling condition
Hypervisor Function Specification(HFS) doesn't require
to disable SynIC timer at timer config write if timer->count = 0.
So drop this check, this allow to load timers MSR's
during migration restore, because config are set before count
in QEMU side.
Also fix condition according to HFS doc(15.3.1):
"It is not permitted to set the SINTx field to zero for an
enabled timer. If attempted, the timer will be
marked disabled (that is, bit 0 cleared) immediately."
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:21 +0000 (18:27 +0300)]
kvm/x86: Reorg stimer_expiration() to better control timer restart
Split stimer_expiration() into two parts - timer expiration message
sending and timer restart/cleanup based on timer state(config).
This also fixes a bug where a one-shot timer message whose delivery
failed once would get lost for good.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:20 +0000 (18:27 +0300)]
kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
This will be used in future to start Hyper-V SynIC timer
in several places by one logic in one function.
Changes v2:
* drop stimer->count == 0 check inside stimer_start()
* comment stimer_start() assumptions
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:19 +0000 (18:27 +0300)]
kvm/x86: Drop stimer_stop() function
The function stimer_stop() is called in one place
so remove the function and replace it's call by function
content.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 28 Dec 2015 15:27:18 +0000 (18:27 +0300)]
kvm/x86: Hyper-V timers fix incorrect logical operation
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 Jan 2016 14:05:10 +0000 (15:05 +0100)]
KVM: move architecture-dependent requests to arch/
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h. Functions
that refer to architecture-specific requests are also moved
to arch/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 Jan 2016 14:02:44 +0000 (15:02 +0100)]
KVM: renumber vcpu->request bits
Leave room for 4 more arch-independent requests.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 Jan 2016 14:00:53 +0000 (15:00 +0100)]
KVM: document which architecture uses each request bit
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 Jan 2016 13:53:46 +0000 (14:53 +0100)]
KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
Suggested-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
[Takuya moved all subsequent constants to fill the void, but that
is useless in view of the following patches. So this change looks
nothing like the original. - Paolo]
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 8 Jan 2016 18:03:03 +0000 (19:03 +0100)]
Merge tag 'kvm-s390-next-4.5-3' of git://git./linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Feature and fix for 4.5
- allow the runtime instrumentation support inside the guests
- remove a useless memory barrier
Nicholas Krause [Wed, 30 Dec 2015 18:08:46 +0000 (13:08 -0500)]
kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
This makes sure the wall clock is updated only after an odd version value
is successfully written to guest memory.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fan Zhang [Thu, 7 Jan 2016 10:24:29 +0000 (18:24 +0800)]
KVM: s390: implement the RI support of guest
This patch adds runtime instrumentation support for KVM guest. We need to
setup a save area for the runtime instrumentation-controls control block(RICCB)
and implement the necessary interfaces to live migrate the guest settings.
We setup the sie control block in a way, that the runtime
instrumentation instructions of a guest are handled by hardware.
We also add a capability KVM_CAP_S390_RI to make this feature opt-in as
it needs migration support.
Signed-off-by: Fan Zhang <zhangfan@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Michael S. Tsirkin [Tue, 5 Jan 2016 16:20:24 +0000 (18:20 +0200)]
kvm/s390: drop unpaired smp_mb
smp_mb on vcpu destroy isn't paired with anything, violating pairing
rules, and seems to be useless.
Drop it.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <
1452010811-25486-1-git-send-email-mst@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
David Matlack [Wed, 30 Dec 2015 16:26:17 +0000 (08:26 -0800)]
kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa
swapped. Fix that, and also add some details describing how each translation
works.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 Jan 2016 10:00:57 +0000 (11:00 +0100)]
Merge tag 'kvm-arm-for-4.5-1' of git://git./linux/kernel/git/kvmarm/kvmarm into kvm-next
KVM/ARM changes for Linux v4.5
- Complete rewrite of the arm64 world switch in C, hopefully
paving the way for more sharing with the 32bit code, better
maintainability and easier integration of new features.
Also smaller and slightly faster in some cases...
- Support for 16bit VM identifiers
- Various cleanups
Takuya Yoshikawa [Fri, 18 Dec 2015 09:54:49 +0000 (18:54 +0900)]
KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
Not just in order to clean up the code, but to make it faster by using
enhanced instructions: the initialization became 20-30% faster on our
testing machine.
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pavel Fedin [Fri, 18 Dec 2015 11:38:43 +0000 (14:38 +0300)]
arm/arm64: KVM: Detect vGIC presence at runtime
Before commit
662d9715840aef44dcb573b0f9fab9e8319c868a
("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}") is was possible to
compile the kernel without vGIC and vTimer support. Commit message says
about possibility to detect vGIC support in runtime, but this has never
been implemented.
This patch introduces runtime check, restoring the lost functionality.
It again allows to use KVM on hardware without vGIC. Interrupt
controller has to be emulated in userspace in this case.
-ENODEV return code from probe function means there's no GIC at all.
-ENXIO happens when, for example, there is GIC node in the device tree,
but it does not specify vGIC resources. Any other error code is still
treated as full stop because it might mean some really serious problems.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Fengguang Wu [Fri, 18 Dec 2015 07:51:44 +0000 (15:51 +0800)]
MAINTAINERS: add git URL for KVM/ARM
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Vladimir Murzin [Mon, 16 Nov 2015 11:28:18 +0000 (11:28 +0000)]
arm64: KVM: Add support for 16-bit VMID
The ARMv8.1 architecture extension allows to choose between 8-bit and
16-bit of VMID, so use this capability for KVM.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Vladimir Murzin [Mon, 16 Nov 2015 11:28:17 +0000 (11:28 +0000)]
arm: KVM: Make kvm_arm.h friendly to assembly code
kvm_arm.h is included from both C code and assembly code; however some
definitions in this header supplied with U/UL/ULL suffixes which might
confuse assembly once they got evaluated.
We have _AC macro for such cases, so just wrap problem places with it.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Vladimir Murzin [Mon, 16 Nov 2015 11:28:16 +0000 (11:28 +0000)]
arm/arm64: KVM: Remove unreferenced S2_PGD_ORDER
Since commit
a987370 ("arm64: KVM: Fix stage-2 PGD allocation to have
per-page refcounting") there is no reference to S2_PGD_ORDER, so kill it
for the good.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Wed, 16 Dec 2015 15:41:12 +0000 (15:41 +0000)]
arm64: KVM: debug: Remove spurious inline attributes
The debug trapping code is pretty heavy on the "inline" attribute,
but most functions are actually referenced in the sysreg tables,
making the inlining imposible.
Removing the useless inline qualifier seems the right thing to do,
having verified that the output code is similar.
Cc: Alex Bennée <alex.bennee@linaro.org>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Mon, 14 Dec 2015 17:58:33 +0000 (17:58 +0000)]
ARM: KVM: Cleanup exception injection
David Binderman reported that the exception injection code had a
couple of unused variables lingering around.
Upon examination, it looked like this code could do with an
anticipated spring cleaning, which amounts to deduplicating
the CPSR/SPSR update, and making it look a bit more like
the architecture spec.
The spurious variables are removed in the process.
Reported-by: David Binderman <dcb314@hotmail.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Andrey Smetanin [Mon, 14 Dec 2015 15:33:05 +0000 (18:33 +0300)]
kvm/x86: Remove Hyper-V SynIC timer stopping
It's possible that guest send us Hyper-V EOM at the middle
of Hyper-V SynIC timer running, so we start processing of Hyper-V
SynIC timers in vcpu context and stop the Hyper-V SynIC timer
unconditionally:
host guest
------------------------------------------------------------------------------
start periodic stimer
start periodic timer
timer expires after 15ms
send expiration message into guest
restart periodic timer
timer expires again after 15 ms
msg slot is still not cleared so
setup ->msg_pending
(1) restart periodic timer
process timer msg and clear slot
->msg_pending was set:
send EOM into host
received EOM
kvm_make_request(KVM_REQ_HV_STIMER)
kvm_hv_process_stimers():
...
stimer_stop()
if (time_now >= stimer->exp_time)
stimer_expiration(stimer);
Because the timer was rearmed at (1), time_now < stimer->exp_time
and stimer_expiration is not called. The timer then never fires.
The patch fixes such situation by not stopping Hyper-V SynIC timer
at all, because it's safe to restart it without stop in vcpu context
and timer callback always returns HRTIMER_NORESTART.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Borislav Petkov [Fri, 20 Nov 2015 18:52:12 +0000 (19:52 +0100)]
kvm: Dump guest rIP when the guest tried something unsupported
It looks like this in action:
kvm [5197]: vcpu0, guest rIP: 0xffffffff810187ba unhandled rdmsr: 0xc001102
and helps to pinpoint quickly where in the guest we did the unsupported
thing.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 3 Dec 2015 14:56:55 +0000 (15:56 +0100)]
KVM: vmx: detect mismatched size in VMCS read/write
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
I am sending this as RFC because the error messages it produces are
very ugly. Because of inlining, the original line is lost. The
alternative is to change vmcs_read/write/checkXX into macros, but
then you need to have a single huge BUILD_BUG_ON or BUILD_BUG_ON_MSG
because multiple BUILD_BUG_ON* with the same __LINE__ are not
supported well.
Paolo Bonzini [Thu, 3 Dec 2015 14:51:00 +0000 (15:51 +0100)]
KVM: VMX: fix read/write sizes of VMCS fields in dump_vmcs
This was not printing the high parts of several 64-bit fields on
32-bit kernels. Separate from the previous one to make the patches
easier to review.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 3 Dec 2015 14:49:56 +0000 (15:49 +0100)]
KVM: VMX: fix read/write sizes of VMCS fields
In theory this should have broken EPT on 32-bit kernels (due to
reading the high part of natural-width field GUEST_CR3). Not sure
if no one noticed or the processor behaves differently from the
documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Li RongQing [Thu, 3 Dec 2015 05:29:34 +0000 (13:29 +0800)]
KVM: VMX: fix the writing POSTED_INTR_NV
POSTED_INTR_NV is 16bit, should not use 64bit write function
[ 5311.676074] vmwrite error: reg 3 value 0 (err 12)
[ 5311.680001] CPU: 49 PID: 4240 Comm: qemu-system-i38 Tainted: G I 4.1.13-WR8.0.0.0_standard #1
[ 5311.689343] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.
021120151325 02/11/2015
[ 5311.699550]
00000000 00000000 e69a7e1c c1950de1 00000000 e69a7e38 fafcff45 fafebd24
[ 5311.706924]
00000003 00000000 0000000c b6a06dfa e69a7e40 fafcff79 e69a7eb0 fafd5f57
[ 5311.714296]
e69a7ec0 c1080600 00000000 00000001 c0e18018 000001be 00000000 00000b43
[ 5311.721651] Call Trace:
[ 5311.722942] [<
c1950de1>] dump_stack+0x4b/0x75
[ 5311.726467] [<
fafcff45>] vmwrite_error+0x35/0x40 [kvm_intel]
[ 5311.731444] [<
fafcff79>] vmcs_writel+0x29/0x30 [kvm_intel]
[ 5311.736228] [<
fafd5f57>] vmx_create_vcpu+0x337/0xb90 [kvm_intel]
[ 5311.741600] [<
c1080600>] ? dequeue_task_fair+0x2e0/0xf60
[ 5311.746197] [<
faf3b9ca>] kvm_arch_vcpu_create+0x3a/0x70 [kvm]
[ 5311.751278] [<
faf29e9d>] kvm_vm_ioctl+0x14d/0x640 [kvm]
[ 5311.755771] [<
c1129d44>] ? free_pages_prepare+0x1a4/0x2d0
[ 5311.760455] [<
c13e2842>] ? debug_smp_processor_id+0x12/0x20
[ 5311.765333] [<
c10793be>] ? sched_move_task+0xbe/0x170
[ 5311.769621] [<
c11752b3>] ? kmem_cache_free+0x213/0x230
[ 5311.774016] [<
faf29d50>] ? kvm_set_memory_region+0x60/0x60 [kvm]
[ 5311.779379] [<
c1199fa2>] do_vfs_ioctl+0x2e2/0x500
[ 5311.783285] [<
c11752b3>] ? kmem_cache_free+0x213/0x230
[ 5311.787677] [<
c104dc73>] ? __mmdrop+0x63/0xd0
[ 5311.791196] [<
c104dc73>] ? __mmdrop+0x63/0xd0
[ 5311.794712] [<
c104dc73>] ? __mmdrop+0x63/0xd0
[ 5311.798234] [<
c11a2ed7>] ? __fget+0x57/0x90
[ 5311.801559] [<
c11a2f72>] ? __fget_light+0x22/0x50
[ 5311.805464] [<
c119a240>] SyS_ioctl+0x80/0x90
[ 5311.808885] [<
c1957d30>] sysenter_do_call+0x12/0x12
[ 5312.059280] kvm: zapping shadow pages for mmio generation wraparound
[ 5313.678415] kvm [4231]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
[ 5313.726518] kvm [4231]: vcpu0 unhandled rdmsr: 0x570
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:21 +0000 (19:22 +0300)]
kvm/x86: Hyper-V SynIC timers
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.
Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR. If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.
Changes v2:
* Use remainder to calculate periodic timer expiration time
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:20 +0000 (19:22 +0300)]
kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ack
The SynIC message protocol mandates that the message slot is claimed
by atomically setting message type to something other than HVMSG_NONE.
If another message is to be delivered while the slot is still busy,
message pending flag is asserted to indicate to the guest that the
hypervisor wants to be notified when the slot is released.
To make sure the protocol works regardless of where the message
sources are (kernel or userspace), clear the pending flag on SINT ACK
notification, and let the message sources compete for the slot again.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:19 +0000 (19:22 +0300)]
kvm/x86: Hyper-V internal helper to read MSR HV_X64_MSR_TIME_REF_COUNT
This helper will be used also in Hyper-V SynIC timers implementation.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:18 +0000 (19:22 +0300)]
kvm/x86: Added Hyper-V vcpu_to_hv_vcpu()/hv_vcpu_to_vcpu() helpers
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:17 +0000 (19:22 +0300)]
kvm/x86: Rearrange func's declarations inside Hyper-V header
This rearrangement places functions declarations together
according to their functionality, so future additions
will be simplier.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:16 +0000 (19:22 +0300)]
drivers/hv: Move struct hv_timer_message_payload into UAPI Hyper-V x86 header
This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:15 +0000 (19:22 +0300)]
drivers/hv: Move struct hv_message into UAPI Hyper-V x86 header
This struct is required for Hyper-V SynIC timers implementation inside KVM
and for upcoming Hyper-V VMBus support by userspace(QEMU). So place it into
Hyper-V UAPI header.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:14 +0000 (19:22 +0300)]
drivers/hv: Move HV_SYNIC_STIMER_COUNT into Hyper-V UAPI x86 header
This constant is required for Hyper-V SynIC timers MSR's
support by userspace(QEMU).
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smetanin [Mon, 30 Nov 2015 16:22:13 +0000 (19:22 +0300)]
drivers/hv: replace enum hv_message_type by u32
enum hv_message_type inside struct hv_message, hv_post_message
is not size portable. Replace enum by u32.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 16 Dec 2015 17:49:29 +0000 (18:49 +0100)]
Merge tag 'kvm-s390-next-4.5-2' of git://git./linux/kernel/git/kvms390/linux into HEAD
KVM: s390 features and fixes for 4.5 (kvm/next)
Some small cleanups
- use assignment instead of memcpy
- use %pK for kernel pointers
Changes regarding guest memory size
- Fix an off-by-one error in our guest memory interface (we might
use unnecessarily big page tables, e.g. 3 levels for a 2GB guest
instead of 2 levels)
- We now ask the machine about the max. supported guest address
and limit accordingly.
Guenther Hutzl [Mon, 1 Dec 2014 16:24:42 +0000 (17:24 +0100)]
KVM: s390: consider system MHA for guest storage
Verify that the guest maximum storage address is below the MHA (maximum
host address) value allowed on the host.
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Guenther Hutzl <hutzl@linux.vnet.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
[adopt to match recent limit,size changes]
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dominik Dingel [Mon, 1 Dec 2014 16:24:42 +0000 (17:24 +0100)]
KVM: s390: fix mismatch between user and in-kernel guest limit
While the userspace interface requests the maximum size the gmap code
expects to get a maximum address.
This error resulted in bigger page tables than necessary for some guest
sizes, e.g. a 2GB guest used 3 levels instead of 2.
At the same time we introduce KVM_S390_NO_MEM_LIMIT, which allows in a
bright future that a guest spans the complete 64 bit address space.
We also switch to TASK_MAX_SIZE for the initial memory size, this is a
cosmetic change as the previous size also resulted in a 4 level pagetable
creation.
Reported-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Tue, 8 Dec 2015 15:55:27 +0000 (16:55 +0100)]
KVM: s390: obey kptr_restrict in traces
The s390dbf and trace events provide a debugfs interface.
If kptr_restrict is active, we should not expose kernel
pointers. We can fence the debugfs output by using %pK
instead of %p.
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Christian Borntraeger [Wed, 2 Dec 2015 13:27:03 +0000 (14:27 +0100)]
KVM: s390: use assignment instead of memcpy
Replace two memcpy with proper assignment.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Marc Zyngier [Mon, 26 Oct 2015 09:10:07 +0000 (09:10 +0000)]
arm64: KVM: Remove weak attributes
As we've now switched to the new world switch implementation,
remove the weak attributes, as nobody is supposed to override
it anymore.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Sun, 25 Oct 2015 20:03:08 +0000 (20:03 +0000)]
arm64: KVM: Cleanup asm-offset.c
As we've now rewritten most of our code-base in C, most of the
KVM-specific code in asm-offset.c is useless. Delete-time again!
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Sun, 25 Oct 2015 19:57:11 +0000 (19:57 +0000)]
arm64: KVM: Turn system register numbers to an enum
Having the system register numbers as #defines has been a pain
since day one, as the ordering is pretty fragile, and moving
things around leads to renumbering and epic conflict resolutions.
Now that we're mostly acessing the sysreg file in C, an enum is
a much better type to use, and we can clean things up a bit.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Sun, 25 Oct 2015 15:51:41 +0000 (15:51 +0000)]
arm64: KVM: Move away from the assembly version of the world switch
This is it. We remove all of the code that has now been rewritten.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Tue, 27 Oct 2015 12:18:48 +0000 (12:18 +0000)]
arm64: KVM: Map the kernel RO section into HYP
In order to run C code in HYP, we must make sure that the kernel's
RO section is mapped into HYP (otherwise things break badly).
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Sun, 25 Oct 2015 13:58:00 +0000 (13:58 +0000)]
arm64: KVM: Add compatibility aliases
So far, we've implemented the new world switch with a completely
different namespace, so that we could have both implementation
compiled in.
Let's take things one step further by adding weak aliases that
have the same names as the original implementation. The weak
attributes allows the new implementation to be overriden by the
old one, and everything still work.
At a later point, we'll be able to simply drop the old code, and
everything will hopefully keep working, thanks to the aliases we
have just added. This also saves us repainting all the callers.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Sun, 25 Oct 2015 15:21:52 +0000 (15:21 +0000)]
arm64: KVM: Add panic handling
Add the panic handler, together with the small bits of assembly
code to call the kernel's panic implementation.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Sun, 25 Oct 2015 08:01:56 +0000 (08:01 +0000)]
arm64: KVM: HYP mode entry points
Add the entry points for HYP mode (both for hypercalls and
exception handling).
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Fri, 23 Oct 2015 07:26:37 +0000 (08:26 +0100)]
arm64: KVM: Implement TLB handling
Implement the TLB handling as a direct translation of the assembly
code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Mon, 26 Oct 2015 08:34:09 +0000 (08:34 +0000)]
arm64: KVM: Implement fpsimd save/restore
Implement the fpsimd save restore, keeping the lazy part in
assembler (as returning to C would be overkill).
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Wed, 21 Oct 2015 08:57:10 +0000 (09:57 +0100)]
arm64: KVM: Implement the core world switch
Implement the core of the world switch in C. Not everything is there
yet, and there is nothing to re-enter the world switch either.
But this already outlines the code structure well enough.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Wed, 28 Oct 2015 08:45:37 +0000 (08:45 +0000)]
arm64: KVM: Add patchable function selector
KVM so far relies on code patching, and is likely to use it more
in the future. The main issue is that our alternative system works
at the instruction level, while we'd like to have alternatives at
the function level.
In order to cope with this, add the "hyp_alternate_select" macro that
outputs a brief sequence of code that in turn can be patched, allowing
an alternative function to be selected.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Thu, 22 Oct 2015 07:32:18 +0000 (08:32 +0100)]
arm64: KVM: Implement guest entry
Contrary to the previous patch, the guest entry is fairly different
from its assembly counterpart, mostly because it is only concerned
with saving/restoring the GP registers, and nothing else.
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Mon, 19 Oct 2015 20:02:46 +0000 (21:02 +0100)]
arm64: KVM: Implement debug save/restore
Implement the debug save restore as a direct translation of
the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Mon, 19 Oct 2015 18:28:29 +0000 (19:28 +0100)]
arm64: KVM: Implement 32bit system register save/restore
Implement the 32bit system register save/restore as a direct
translation of the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Mon, 19 Oct 2015 17:02:48 +0000 (18:02 +0100)]
arm64: KVM: Implement system register save/restore
Implement the system register save/restore as a direct translation of
the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Marc Zyngier [Mon, 19 Oct 2015 15:32:20 +0000 (16:32 +0100)]
arm64: KVM: Implement timer save/restore
Implement the timer save restore as a direct translation of
the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Mon, 19 Oct 2015 14:50:58 +0000 (15:50 +0100)]
arm64: KVM: Implement vgic-v3 save/restore
Implement the vgic-v3 save restore as a direct translation of
the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Tue, 1 Dec 2015 13:48:56 +0000 (13:48 +0000)]
KVM: arm/arm64: vgic-v3: Make the LR indexing macro public
We store GICv3 LRs in reverse order so that the CPU can save/restore
them in rever order as well (don't ask why, the design is crazy),
and yet generate memory traffic that doesn't completely suck.
We need this macro to be available to the C version of save/restore.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Mon, 19 Oct 2015 14:50:37 +0000 (15:50 +0100)]
arm64: KVM: Implement vgic-v2 save/restore
Implement the vgic-v2 save restore (mostly) as a direct translation
of the assembly code version.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Marc Zyngier [Wed, 21 Oct 2015 09:09:49 +0000 (10:09 +0100)]
arm64: KVM: Add a HYP-specific header file
In order to expose the various EL2 services that are private to
the hypervisor, add a new hyp.h file.
So far, it only contains mundane things such as section annotation
and VA manipulation.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Mark Rutland [Thu, 5 Nov 2015 15:09:17 +0000 (15:09 +0000)]
arm64: Add macros to read/write system registers
Rather than crafting custom macros for reading/writing each system
register provide generics accessors, read_sysreg and write_sysreg, for
this purpose.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Amit Tomar [Thu, 26 Nov 2015 10:09:43 +0000 (10:09 +0000)]
KVM: arm/arm64: Count guest exit due to various reasons
It would add guest exit statistics to debugfs, this can be helpful
while measuring KVM performance.
[ Renamed some of the field names - Christoffer ]
Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Jisheng Zhang [Thu, 12 Nov 2015 11:59:14 +0000 (19:59 +0800)]
KVM: arm/arm64: vgic: make vgic_io_ops static
vgic_io_ops is only referenced within vgic.c, so it can be declared
static.
Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Linus Torvalds [Mon, 14 Dec 2015 01:42:58 +0000 (17:42 -0800)]
Linux 4.4-rc5
Peter Zijlstra [Sun, 13 Dec 2015 21:11:16 +0000 (22:11 +0100)]
sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 13 Dec 2015 20:46:04 +0000 (12:46 -0800)]
Merge tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfix from Trond Myklebust:
"SUNRPC: Fix a NFSv4.1 callback channel regression"
* tag 'nfs-for-4.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Fix callback channel
Linus Torvalds [Sun, 13 Dec 2015 20:41:10 +0000 (12:41 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fixlets from Thomas Gleixner:
"Two trivial fixes which add missing header fileas and forward
declarations so the code will compile even when the magic include
chains are different"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Add missing include for barrier.h
irqchip/gic-v3: Add missing struct device_node declaration
Linus Torvalds [Sun, 13 Dec 2015 20:36:23 +0000 (12:36 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix to unbreak a clocksource driver which has more than 32bit
counter width"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Mmio: remove artificial 32bit limitation
Linus Torvalds [Sun, 13 Dec 2015 20:29:22 +0000 (12:29 -0800)]
Merge tag 'char-misc-4.4-rc5' of git://git./linux/kernel/git/gregkh/char-misc
Pull fpga driver fixes from Greg KH:
"Only two small fpga driver fixes here, both have been in linux-next
for a while, and resolve some reported issues"
* tag 'char-misc-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
fpga manager: Fix firmware resource leak on error
fpga manager: remove label
Linus Torvalds [Sun, 13 Dec 2015 20:24:39 +0000 (12:24 -0800)]
Merge tag 'staging-4.4-rc5' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are a few staging and IIO driver fixes for 4.4-rc5.
All of them resolve reported problems and have been in linux-next for
a while. Nothing major here, just small fixes where needed"
* tag 'staging-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: lustre: echo_copy.._lsm() dereferences userland pointers directly
iio: adc: spmi-vadc: add missing of_node_put
iio: fix some warning messages
iio: light: apds9960: correct ->last_busy count
iio: lidar: return -EINVAL on invalid signal
staging: iio: dummy: complete IIO events delivery to userspace
Linus Torvalds [Sun, 13 Dec 2015 19:58:18 +0000 (11:58 -0800)]
Merge tag 'usb-4.4-rc5' of git://git./linux/kernel/git/gregkh/usb
Pull USB driver fixes from Greg KH:
"Here are a number of small USB fixes for 4.4-rc5. All of them have
been in linux-next. The majority are gadget and phy issues, with a
few new quirks and device ids added as well"
* tag 'usb-4.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (32 commits)
USB: add quirk for devices with broken LPM
xhci: fix usb2 resume timing and races.
usb: musb: fail with error when no DMA controller set
usb: gadget: uvc: fix permissions of configfs attributes
usb: musb: core: Fix pm runtime for deferred probe
usb: phy: msm: fix a possible NULL dereference
USB: host: ohci-at91: fix a crash in ohci_hcd_at91_overcurrent_irq
usb: Quiet down false peer failure messages
usb: xhci: fix config fail of FS hub behind a HS hub with MTT
xhci: Fix memory leak in xhci_pme_acpi_rtd3_enable()
usb: Use the USB_SS_MULT() macro to decode burst multiplier for log message
USB: whci-hcd: add check for dma mapping error
usb: core : hub: Fix BOS 'NULL pointer' kernel panic
USB: quirks: Apply ALWAYS_POLL to all ELAN devices
usb-storage: Fix scsi-sd failure "Invalid field in cdb" for USB adapter JMicron
USB: quirks: Fix another ELAN touchscreen
usb: dwc3: gadget: don't prestart interrupt endpoints
USB: serial: Another Infineon flash loader USB ID
USB: cdc_acm: Ignore Infineon Flash Loader utility
USB: cp210x: Remove CP2110 ID from compatibility list
...
Linus Torvalds [Sun, 13 Dec 2015 00:43:44 +0000 (16:43 -0800)]
Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Here are a bunch of small bug fixes for various ARM platforms, nothing
really sticks out this week, most of either fixes bugs in code that
was just added in 4.4, or that has been broken for many years without
anyone noticing.
at91/sama5d2:
- fix sama5de hardware setup of sd/mmc interface
- proper selection of pinctrl drivers. PIO4 is necessary for sama5d2
berlin:
- fix incorrect clock input for SDIO
exynos:
- Fix potential NULL pointer dereference in Exynos PMU driver.
imx:
- Fix vf610 SAI clock configuration bug which is discovered by the
newly added master mode support in SAI audio driver.
- Fix buggy L2 cache latency values in vf610 device trees, which may
cause system hang when cpu runs at a higher frequency.
ixp4xx:
- fix prototypes for readl/writel functions
ls2080a:
- use little-endian register access for GPIO and SDHCI
omap:
- Fix clock source for ARM TWD and global timers on am437x
- Always select REGULATOR_FIXED_VOLTAGE for omap2+ instead of when
MACH_OMAP3_PANDORA is selected
- Fix SPI DMA handles for dm816x as only some were mapped
- Fix up mbox cells for dm816x to make mailbox usable
pxa:
- use PWM lookup table for all ezx machines
s3c24xx:
- Remove incorrect __init annotation from s3c24xx cpufreq driver
structures.
versatile:
- fix PCI IRQ mapping on Versatile PB"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ls2080a/dts: Add little endian property for GPIO IP block
dt-bindings: define little-endian property for QorIQ GPIO
ARM64: dts: ls2080a: fix eSDHC endianness
ARM: dts: vf610: use reset values for L2 cache latencies
ARM: pxa: use PWM lookup table for all machines
ARM: dts: berlin: add 2nd clock for BG2Q sdhci0 and sdhci1
ARM: dts: berlin: correct BG2Q's sdhci2 2nd clock
ARM: dts: am4372: fix clock source for arm twd and global timers
ARM: at91: fix pinctrl driver selection
ARM: at91/dt: add always-on to 1.8V regulator
ARM: dts: vf610: fix clock definition for SAI2
ARM: imx: clk-vf610: fix SAI clock tree
ARM: ixp4xx: fix read{b,w,l} return types
irqchip/versatile-fpga: Fix PCI IRQ mapping on Versatile PB
ARM: OMAP2+: enable REGULATOR_FIXED_VOLTAGE
ARM: dts: add dm816x missing spi DT dma handles
ARM: dts: add dm816x missing #mbox-cells
cpufreq: s3c24xx: Do not mark s3c2410_plls_add as __init
ARM: EXYNOS: Fix potential NULL pointer access in exynos_sys_powerdown_conf
Linus Torvalds [Sat, 12 Dec 2015 21:39:59 +0000 (13:39 -0800)]
Merge tag 'powerpc-4.4-4' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- opal-irqchip: Fix double endian conversion from Alistair Popple
- cxl: Set endianess of kernel contexts from Frederic Barrat
- sbc8641: drop bogus PHY IRQ entries from DTS file from Paul Gortmaker
- Revert "powerpc/eeh: Don't unfreeze PHB PE after reset" from Andrew
Donnellan
* tag 'powerpc-4.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
Revert "powerpc/eeh: Don't unfreeze PHB PE after reset"
powerpc/sbc8641: drop bogus PHY IRQ entries from DTS file
cxl: Set endianess of kernel contexts
powerpc/opal-irqchip: Fix double endian conversion
Linus Torvalds [Sat, 12 Dec 2015 18:44:49 +0000 (10:44 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MIPS: fix DMA contiguous allocation
sh64: fix __NR_fgetxattr
ocfs2: fix SGID not inherited issue
mm/oom_kill.c: avoid attempting to kill init sharing same memory
drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
tmpfs: fix shmem_evict_inode() warnings on i_blocks
mm/hugetlb.c: fix resv map memory leak for placeholder entries
mm: hugetlb: call huge_pte_alloc() only if ptep is null
kernel: remove stop_machine() Kconfig dependency
mm: kmemleak: mark kmemleak_init prototype as __init
mm: fix kerneldoc on mem_cgroup_replace_page
osd fs: __r4w_get_page rely on PageUptodate for uptodate
MAINTAINERS: make Vladimir co-maintainer of the memory controller
mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo
memcg: fix memory.high target
mm: hugetlb: fix hugepage memory leak caused by wrong reserve count
Linus Torvalds [Sat, 12 Dec 2015 18:34:20 +0000 (10:34 -0800)]
Merge branch 'parisc-4.4-3' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Fix the boot crash on Mako machines with Huge Pages, prevent a panic
with SATA controllers (and others) by correctly calculating the IOMMU
space, hook up the mlock2 syscall and drop unneeded code in the parisc
pci code"
* 'parisc-4.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Disable huge pages on Mako machines
parisc: Wire up mlock2 syscall
parisc: Remove unused pcibios_init_bus()
parisc iommu: fix panic due to trying to allocate too large region
Linus Torvalds [Sat, 12 Dec 2015 18:24:00 +0000 (10:24 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series. This contains:
- A bunch of fixes for lightnvm, should be the last round for this
series. From Matias and Wenwei.
- A writeback detach inode fix from Ilya, also marked for stable.
- A block (though it says SCSI) fix for an OOPS in SCSI runtime power
management.
- Module init error path fixes for null_blk from Minfei"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: Fix error path in module initialization
lightnvm: do not compile in debugging by default
lightnvm: prevent gennvm module unload on use
lightnvm: fix media mgr registration
lightnvm: replace req queue with nvmdev for lld
lightnvm: comments on constants
lightnvm: check mm before use
lightnvm: refactor spin_unlock in gennvm_get_blk
lightnvm: put blks when luns configure failed
lightnvm: use flags in rrpc_get_blk
block: detach bdev inode from its wb in __blkdev_put()
SCSI: Fix NULL pointer dereference in runtime PM
Linus Torvalds [Sat, 12 Dec 2015 18:16:26 +0000 (10:16 -0800)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Update the linker script to use L1_CACHE_BYTES instead of hard-coded
64. We recently changed L1_CACHE_BYTES to 128
- Improve race condition reporting on set_pte_at() and change the BUG
to WARN_ONCE. With hardware update of the accessed/dirty state, we
need to ensure that set_pte_at() does not inadvertently override
hardware updated state. The patch also makes the checks ignore
!pte_valid() new entries
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Improve error reporting on set_pte_at() checks
arm64: update linker script to increased L1_CACHE_BYTES value
Qais Yousef [Fri, 11 Dec 2015 21:41:09 +0000 (13:41 -0800)]
MIPS: fix DMA contiguous allocation
Recent changes to how GFP_ATOMIC is defined seems to have broken the
condition to use mips_alloc_from_contiguous() in
mips_dma_alloc_coherent().
I couldn't bottom out the exact change but I think it's this commit
d0164adc89f6 ("mm, page_alloc: distinguish between being unable to
sleep, unwilling to sleep and avoiding waking kswapd").
GFP_ATOMIC has multiple bits set and the check for !(gfp & GFP_ATOMIC)
isn't enough.
The reason behind this condition is to check whether we can potentially
do a sleeping memory allocation. Use gfpflags_allow_blocking() instead
which should be more robust.
Signed-off-by: Qais Yousef <qais.yousef@imgtec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry V. Levin [Fri, 11 Dec 2015 21:41:06 +0000 (13:41 -0800)]
sh64: fix __NR_fgetxattr
According to arch/sh/kernel/syscalls_64.S and common sense, __NR_fgetxattr
has to be defined to 259, but it doesn't. Instead, it's defined to 269,
which is of course used by another syscall, __NR_sched_setaffinity in this
case.
This bug was found by strace test suite.
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Fri, 11 Dec 2015 21:41:03 +0000 (13:41 -0800)]
ocfs2: fix SGID not inherited issue
Commit
8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an
issue, SGID of sub dir was not inherited from its parents dir. It is
because SGID is set into "inode->i_mode" in ocfs2_get_init_inode(), but
is overwritten by "mode" which don't have SGID set later.
Fixes: 8f1eb48758aa ("ocfs2: fix umask ignored issue")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Jie [Fri, 11 Dec 2015 21:41:00 +0000 (13:41 -0800)]
mm/oom_kill.c: avoid attempting to kill init sharing same memory
It's possible that an oom killed victim shares an ->mm with the init
process and thus oom_kill_process() would end up trying to kill init as
well.
This has been shown in practice:
Out of memory: Kill process 9134 (init) score 3 or sacrifice child
Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
Kill process 1 (init) sharing same memory
...
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
And this will result in a kernel panic.
If a process is forked by init and selected for oom kill while still
sharing init_mm, then it's likely this system is in a recoverable state.
However, it's better not to try to kill init and allow the machine to
panic due to unkillable processes.
[rientjes@google.com: rewrote changelog]
[akpm@linux-foundation.org: fix inverted test, per Ben]
Signed-off-by: Chen Jie <chenjie6@huawei.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Seth Jennings [Fri, 11 Dec 2015 21:40:57 +0000 (13:40 -0800)]
drivers/base/memory.c: prohibit offlining of memory blocks with missing sections
Commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory
x86-64 systems") and
982792c782ef ("x86, mm: probe memory block size for
generic x86 64bit") introduced large block sizes for x86. This made it
possible to have multiple sections per memory block where previously,
there was a only every one section per block.
Since blocks consist of contiguous ranges of section, there can be holes
in the blocks where sections are not present. If one attempts to
offline such a block, a crash occurs since the code is not designed to
deal with this.
This patch is a quick fix to gaurd against the crash by not allowing
blocks with non-present sections to be offlined.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781
Signed-off-by: Seth Jennings <sjennings@variantweb.net>
Reported-by: Andrew Banman <abanman@sgi.com>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 11 Dec 2015 21:40:55 +0000 (13:40 -0800)]
tmpfs: fix shmem_evict_inode() warnings on i_blocks
Dmitry Vyukov provides a little program, autogenerated by syzkaller,
which races a fault on a mapping of a sparse memfd object, against
truncation of that object below the fault address: run repeatedly for a
few minutes, it reliably generates shmem_evict_inode()'s
WARN_ON(inode->i_blocks).
(But there's nothing specific to memfd here, nor to the fstat which it
happened to use to generate the fault: though that looked suspicious,
since a shmem_recalc_inode() had been added there recently. The same
problem can be reproduced with open+unlink in place of memfd_create, and
with fstatfs in place of fstat.)
v3.7 commit
0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING")
explains one cause of such a warning (a race with shmem_writepage to
swap), and possible solutions; but we never took it further, and this
syzkaller incident turns out to have a different cause.
shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
then found to be beyond eof, looks plausible - decrementing the alloced
count that was just before incremented - but in fact can go wrong, if a
racing thread (the truncator, for example) gets its shmem_recalc_inode()
in just after our delete_from_page_cache(). delete_from_page_cache()
decrements nrpages, that shmem_recalc_inode() will balance the books by
decrementing alloced itself, then our decrement of alloced take it one
too low: leading to the WARNING when the object is finally evicted.
Once the new page has been exposed in the page cache,
shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
the accounting right in all cases (and not fall through from "trunc:" to
"decused:"). Adjust that error recovery block; and the reinitialization
of info and sbinfo can be removed too.
While we're here, fix shmem_writepage() to avoid the original issue: it
will be safe against a racing shmem_recalc_inode(), if it merely
increments swapped before the shmem_delete_from_page_cache() which
decrements nrpages (but it must then do its own shmem_recalc_inode()
before that, while still in balance, instead of after). (Aside: why do
we shmem_recalc_inode() here in the swap path? Because its raison d'etre
is to cope with clean sparse shmem pages being reclaimed behind our
back: so here when swapping is a good place to look for that case.) But
I've not now managed to reproduce this bug, even without the patch.
I don't see why I didn't do that earlier: perhaps inhibited by the
preference to eliminate shmem_recalc_inode() altogether. Driven by this
incident, I do now have a patch to do so at last; but still want to sit
on it for a bit, there's a couple of questions yet to be resolved.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 11 Dec 2015 21:40:52 +0000 (13:40 -0800)]
mm/hugetlb.c: fix resv map memory leak for placeholder entries
Dmitry Vyukov reported the following memory leak
unreferenced object 0xffff88002eaafd88 (size 32):
comm "a.out", pid 5063, jiffies
4295774645 (age 15.810s)
hex dump (first 32 bytes):
28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmalloc include/linux/slab.h:458
region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
__vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
vma_needs_reservation mm/hugetlb.c:1813
alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
hugetlb_no_page mm/hugetlb.c:3543
hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
__get_user_pages+0x542/0xf30 mm/gup.c:497
populate_vma_page_range+0xde/0x110 mm/gup.c:919
__mm_populate+0x1c7/0x310 mm/gup.c:969
do_mlock+0x291/0x360 mm/mlock.c:637
SYSC_mlock2 mm/mlock.c:658
SyS_mlock2+0x4b/0x70 mm/mlock.c:648
Dmitry identified a potential memory leak in the routine region_chg,
where a region descriptor is not free'ed on an error path.
However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left
in the reserve map. This is "by design" as the entry should be deleted
when the map is released. The bug is in the region_del routine which is
used to delete entries within a specific range (and when the map is
released). region_del did not handle the case where a placeholder entry
exactly matched the start of the range range to be deleted. In this
case, the entry would not be deleted and leaked. The fix is to take
these special placeholder entries into account in region_del.
The region_chg error path leak is also fixed.
Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 11 Dec 2015 21:40:49 +0000 (13:40 -0800)]
mm: hugetlb: call huge_pte_alloc() only if ptep is null
Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
and check whether the obtained *ptep is a migration/hwpoison entry or
not. And if not, then we get to call huge_pte_alloc(). This is racy
because the *ptep could turn into migration/hwpoison entry after the
huge_pte_offset() check. This race results in BUG_ON in
huge_pte_alloc().
We don't have to call huge_pte_alloc() when the huge_pte_offset()
returns non-NULL, so let's fix this bug with moving the code into else
block.
Note that the *ptep could turn into a migration/hwpoison entry after
this block, but that's not a problem because we have another
!pte_present check later (we never go into hugetlb_no_page() in that
case.)
Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Wilson [Fri, 11 Dec 2015 21:40:46 +0000 (13:40 -0800)]
kernel: remove stop_machine() Kconfig dependency
Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable. This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.
For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.
This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.
Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Iooss [Fri, 11 Dec 2015 21:40:43 +0000 (13:40 -0800)]
mm: kmemleak: mark kmemleak_init prototype as __init
The kmemleak_init() definition in mm/kmemleak.c is marked __init but its
prototype in include/linux/kmemleak.h is marked __ref since commit
a6186d89c913 ("kmemleak: Mark the early log buffer as __initdata").
This causes a section mismatch which is reported as a warning when
building with clang -Wsection, because kmemleak_init() is declared in
section .ref.text but defined in .init.text.
Fix this by marking kmemleak_init() prototype __init.
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 11 Dec 2015 21:40:40 +0000 (13:40 -0800)]
mm: fix kerneldoc on mem_cgroup_replace_page
Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 11 Dec 2015 21:40:38 +0000 (13:40 -0800)]
osd fs: __r4w_get_page rely on PageUptodate for uptodate
Commit
42cb14b110a5 ("mm: migrate dirty page without
clear_page_dirty_for_io etc") simplified the migration of a PageDirty
pagecache page: one stat needs moving from zone to zone and that's about
all.
It's convenient and safest for it to shift the PageDirty bit from old
page to new, just before updating the zone stats: before copying data
and marking the new PageUptodate. This is all done while both pages are
isolated and locked, just as before; and just as before, there's a
moment when the new page is visible in the radix_tree, but not yet
PageUptodate. What's new is that it may now be briefly visible as
PageDirty before it is PageUptodate.
When I scoured the tree to see if this could cause a problem anywhere,
the only places I found were in two similar functions __r4w_get_page():
which look up a page with find_get_page() (not using page lock), then
claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate.
I'm not sure whether that was right before, but now it might be wrong
(on rare occasions): only claim the page is uptodate if PageUptodate.
Or perhaps the page in question could never be migratable anyway?
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Boaz Harrosh <ooo@electrozaur.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 11 Dec 2015 21:40:35 +0000 (13:40 -0800)]
MAINTAINERS: make Vladimir co-maintainer of the memory controller
Vladimir architected and authored much of the current state of the
memcg's slab memory accounting and tracking. Make sure he gets CC'd on
bug reports ;-)
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 11 Dec 2015 21:40:32 +0000 (13:40 -0800)]
mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress
Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.
The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context. If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much. WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation. This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.
In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep. In order to prevent from issues handled by
0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.
The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Cristopher Lameter <clameter@sgi.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>