Matan Barak [Tue, 18 Apr 2017 09:03:41 +0000 (12:03 +0300)]
IB/core: Don't use is_async in event files to infer events size
Previously, we inferred the events size in ib_uverbs_event_read by
using the is_async flag. Instead of that, we pass the event size
directly.
Fixes: 1e7710f3f656 ('IB/core: Change completion channel to use the reworked objects schema')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:40 +0000 (12:03 +0300)]
IB/core: A small refactor in destroy WQ handler
Instead of having uverbs_uobject_put both in the error flow and the
good flow, we unite them.
Fixes: fd3c7904db6e ('IB/core: Change idr objects to use the new schema')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:39 +0000 (12:03 +0300)]
IB/core: Nullify ib_uobject during allocation
Currently, we initialize all fields of ib_uobject straight after
allocation. Therefore, a kmalloc was sufficient. Since ib_uobject
could be embedded in a type specific structure, we nullify it to
spare programmer errors.
Fixes: 3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:38 +0000 (12:03 +0300)]
IB/core: Don't pass the lock state to _rdma_remove_commit_uobject
The only scenario where this function was called while the lock is
already taken is in the context cleanup scenario. Thus, in order not
to pass the lock state to this function, we just call the remove logic
straight from the cleanup context function.
Fixes: 3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 18 Apr 2017 09:03:37 +0000 (12:03 +0300)]
IB/core: Rename write flag to exclusive in rdma_core
We rename the "write" flags to "exclusive", as it's used for both
WRITE and DESTROY actions.
Fixes: 3832125624b7 ('IB/core: Add support for idr types')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:26:26 +0000 (17:26 -0700)]
IB/hfi1: Eliminate synchronize_rcu() in mr delete
The synchronize_rcu() call can be eliminated to improve memory deregistration
performance.
There are two key fields involved:
- The rcu pointer itself
- the lkey_published field
To close the window between the rcu read of the mregion pointer and the
reference count the code should:
1. To lkey/rkey validation (reader)
Read the rcu pointer. If the pointer is non-NULL, get a reference.
To the current validation tests use a READ_ONCE() on the lkey_published.
Upon any failure release the reference.
2. To the remove logic (delete)
Insure the published is zeroed prior to setting the pointer to NULL.
This requires using rcu_assign_pointer() to insure lkey_published
is written prior to the NULL.
3. To the insert logic (add)
Insure the published is set use an rcu_assign_pointer() to insure the
pointer is after all MR fields.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Don Hiatt [Tue, 21 Mar 2017 00:26:20 +0000 (17:26 -0700)]
IB/hfi1: Add transmit fault injection feature
Add ability to fault packets on transmit by opcode.
Dropping by packet can be achieved by setting the mask to 0.
In order to drop non-verbs traffic we set PbcInsertHrc
to NONE (0x2). The packet will still be delivered to
the receiving node but a KHdrHCRCErr (KDETH packet
with a bad HCRC) will be triggered and the packet will
not be delivered to the correct context.
In order to drop regular verbs traffic we set the
PbcTestEbp flag. The packet will still be delivered
to the receiving node but a 'late ebp error' will
be triggered and will be dropped.
A global toggle (/sys/kernel/debug/hfi1/hfi1_X/fault_suppress_err)
has been added to suppress the error messages on the receive
node when a packet was faulted on the sending node.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Don Hiatt [Tue, 21 Mar 2017 00:26:14 +0000 (17:26 -0700)]
IB/hfi1: Add receive fault injection feature
Add fault injection capability:
- Drop packets unconditionally (fault_by_packet)
- Drop packets based on opcode (fault_by_opcode)
This feature reacts to the global FAULT_INJECTION
config flag.
The faulting traces have been added:
- misc/fault_opcode
- misc/fault_packet
See 'Documentation/fault-injection/fault-injection.txt'
for details.
Examples:
- Dropping packets by opcode:
/sys/kernel/debug/hfi1/hfi1_X/fault_opcode
# Enable fault
echo Y > fault_by_opcode
# Setprobability of dropping (0-100%)
# echo 25 > probability
# Set opcode
echo 0x64 > opcode
# Number of times to fault
echo 3 > times
# An optional mask allows you to fault
# a range of opcodes
echo 0xf0 > mask
/sys/kernel/debug/hfi1/hfi1_X/fault_stats
contains a value in parentheses to indicate
number of each opcode dropped.
- Dropping packets unconditionally
/sys/kernel/debug/hfi1/hfi1_X/fault_packet
# Enable fault
echo Y > fault_by_packet
/sys/kernel/debug/hfi1/hfi1_X/fault_packet/fault_stats
contains the number of packets dropped.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael J. Ruhl [Tue, 21 Mar 2017 00:26:07 +0000 (17:26 -0700)]
IB/hfi1: Ensure VL index is within bounds
Improve the safety of the code and ensure the array cannot be indexed
out of bounds when picking the CPU for a given SDMA engine.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:26:01 +0000 (17:26 -0700)]
IB/rdmavt: Avoid reseting wqe send_flags in unreserve
The wqe should be read only and in fact the superfluous reset of the
RVT_SEND_RESERVE_USED flag causes an issue where reserved operations
elicit a bad completion to the ULP.
The maintenance of the flag is now entirely within rvt_post_one_wr()
where a reserved operation will set the flag and a non-reserved operation
will insure the operation that is about to be posted has the flag reset.
Fixes: Commit 856cc4c237ad ("IB/hfi1: Add the capability for reserved operations")
Reviewed-by: Don Hiatt <don.hiatt@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sebastian Sanchez [Tue, 21 Mar 2017 00:25:55 +0000 (17:25 -0700)]
IB/rdmavt, IB/hfi1: Fix timer migration regressions
RC timeout counter isn't getting incremented.
Increment counter and add the trace for it.
Fixes: 87c23b4ab018 ("IB/rdmavt: Adding timer logic to rdmavt")
Reviewed-by: Brian Welty <brian.welty@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael J. Ruhl [Tue, 21 Mar 2017 00:25:48 +0000 (17:25 -0700)]
IB/hfi1: Add a patch value to the firmware version string
The HFI firmware now includes a patch level in its version.
Updating the necessary code to include the patch version in the
firmware string.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Easwar Hariharan [Tue, 21 Mar 2017 00:25:42 +0000 (17:25 -0700)]
IB/hfi1: Check for QSFP presence before attempting reads
Attempting to read the status of a QSFP cable creates noise in the logs
and misses out on setting an appropriate Offline/Disabled Reason if the
cable is not plugged in. Check for this prior to attempting the read and
attendant retries.
Fixes: 673b975f1fba ("IB/hfi1: Add QSFP sanity pre-check")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Tadeusz Struk [Tue, 21 Mar 2017 00:25:35 +0000 (17:25 -0700)]
IB/hfi1: Protect the global dev_cntr_names and port_cntr_names
Protect the global dev_cntr_names and port_cntr_names with the global
mutex as they are allocated and freed in a function called per device.
Otherwise there is a danger of double free and memory leaks.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Tadeusz Struk [Tue, 21 Mar 2017 00:25:29 +0000 (17:25 -0700)]
IB/hfi1: Check device id early during init
If there is a wrong device passed to the driver it should fail early,
without trying to initialize the device only to find out that it has
an invalid device later during the init.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:25:23 +0000 (17:25 -0700)]
IB/rdmavt: Add swqe completion trace
The following fields are available for filter/trace:
- wqe
- wr_id
- qpn
- qpt
- length
- idx
- ssn
- (wr)opcode
- (wr)send_flags
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:25:16 +0000 (17:25 -0700)]
IB/rdmavt: Add tracing for cq entry and poll
The following fields are defined for filtering and triggering:
- wr_id
- status
- opcode
- qpn
- length
- idx
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:25:10 +0000 (17:25 -0700)]
IB/rdmavt: Add additional fields to post send trace
This fix is to get additional debugging information.
The following fields are added:
- wqe
- qpt
- num_sge
- ssn
- pid
- send_flags
These additional fields provide for more focused filtering
and triggering.
The patch also moves the trace to just before the wqe is
posted to get the most accurate information and future proofs
the code to trace all possible reserved opcodes.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 21 Mar 2017 00:25:04 +0000 (17:25 -0700)]
IB/rdmavt, IB/hfi1, IB/qib: Make wc opcode translation driver dependent
The work to create a completion helper moved the translation of send
wqe operations to completion opcodes to rdmvat.
This precludes having driver dependent operations. Make the translation
driver dependent by doing the translation in the driver prior to the
rvt_qp_swqe_complete() call using restored translation tables.
Fixes: Commit f2dc9cdce83c ("IB/rdmavt: Add a send completion helper")
Fixes: Commit 0771da5a6e9d ("IB/hfi1,IB/qib: Use new send completion helper")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sebastian Sanchez [Tue, 21 Mar 2017 00:24:58 +0000 (17:24 -0700)]
IB/hfi1: NULL pointer dereference when freeing rhashtable
A NULL pointer dereference occurs when the driver
is unloaded, and the SDMA rhashtable is freed if
the rhashtable_init() function has not been called.
Prevent this by changing sdma_rht to be a pointer
to a dynamically allocated hash table. The NULL-ness
of the pointer serves as an indication that the hash
table was initialized and that it needs to be
destroyed.
Fixes: 0cb2aa690c7e ("IB/hfi1: Add sysfs interface for affinity setup")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael J. Ruhl [Tue, 21 Mar 2017 00:24:51 +0000 (17:24 -0700)]
IB/hfi1: Cache registers during state change
When the LCB is going offline, inopportune port queries can cause
benign error messages to be logged. To deal with this, cache the
registers just before setting the LCB to offline, allowing queries to
return without eliciting the error.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael J. Ruhl [Tue, 21 Mar 2017 00:24:45 +0000 (17:24 -0700)]
IB/hfi1: Race hazard avoidance in user SDMA driver
Set the errcode before the state and add the smb_wmb() to avoid a
potential race condition with the user.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Michael Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 21 Mar 2017 00:24:39 +0000 (17:24 -0700)]
IB/hfi1: Force logical link down
If the logical link state does not read as down when
the physical link state is offline, force it to down.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Jakub Byczkowski <jakub.byczkowski@intel.com>
Signed-off-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Shamir Rabinovitch [Wed, 29 Mar 2017 10:21:59 +0000 (06:21 -0400)]
IB/IPoIB: ibX: failed to create mcg debug file
When udev renames the netdev devices, ipoib debugfs entries does not
get renamed. As a result, if subsequent probe of ipoib device reuse the
name then creating a debugfs entry for the new device would fail.
Also, moved ipoib_create_debug_files and ipoib_delete_debug_files as part
of ipoib event handling in order to avoid any race condition between these.
Fixes: 1732b0ef3b3a ([IPoIB] add path record information in debugfs)
Cc: stable@vger.kernel.org # 2.6.15+
Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Mark Brown [Thu, 30 Mar 2017 14:56:01 +0000 (15:56 +0100)]
IB/hns: Explicitly include linux/of.h
hns_roce_hw_v1.c uses DT interfaces but relies on implict inclusion of
linux/of.h which means that changes in other headers could break the
build, as happened in -next for arm64 today. Add an explicit include.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:47 +0000 (13:31 +0300)]
IB/core: Change completion channel to use the reworked objects schema
This patch adds the standard fd based type - completion_channel.
The completion_channel is now prefixed with ib_uobject, similarly
to the rest of the uobjects.
This requires a few changes:
(1) We define a new completion channel fd based object type.
(2) completion_event and async_event are now two different types.
This means they use different fops.
(3) We release the completion_channel exactly as we release other
idr based objects.
(4) Since ib_uobjects are already kref-ed, we only add the kref to the
async event.
A fd object requires filling out several parameters. Its op pointer
should point to uverbs_fd_ops and its size should be at least the
size if ib_uobject. We use a macro to make the type declaration
easier.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:46 +0000 (13:31 +0300)]
IB/core: Add support for fd objects
The completion channel we use in verbs infrastructure is FD based.
Previously, we had a separate way to manage this object. Since we
strive for a single way to manage any kind of object in this
infrastructure, we conceptually treat all objects as subclasses
of ib_uobject.
This commit adds the necessary mechanism to support FD based objects
like their IDR counterparts. FD objects release need to be synchronized
with context release. We use the cleanup_mutex on the uverbs_file for
that.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:45 +0000 (13:31 +0300)]
IB/core: Add lock to multicast handlers
When two handlers used the same object in the old schema, we blocked
the process in the kernel. The new schema just returns -EBUSY. This
could lead to different behaviour in applications between the old
schema and the new schema. In most cases, using such handlers
concurrently could lead to crashing the process. For example, if
thread A destroys a QP and thread B modifies it, we could have the
destruction happens before the modification. In this case, we are
accessing freed memory which could lead to crashing the process.
This is true for most cases. However, attaching and detaching
a multicast address from QP concurrently is safe. Therefore, we
preserve the original behaviour by adding a lock there.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:44 +0000 (13:31 +0300)]
IB/core: Change idr objects to use the new schema
This changes only the handlers which deals with idr based objects to
use the new idr allocation, fetching and destruction schema.
This patch consists of the following changes:
(1) Allocation, fetching and destruction is done via idr ops.
(2) Context initializing and release is done through
uverbs_initialize_ucontext and uverbs_cleanup_ucontext.
(3) Ditching the live flag. Mostly, this is pretty straight
forward. The only place that is a bit trickier is in
ib_uverbs_open_qp. Commit [1] added code to check whether
the uobject is already live and initialized. This mostly
happens because of a race between open_qp and events.
We delayed assigning the uobject's pointer in order to
eliminate this race without using the live variable.
[1] commit
a040f95dc819
("IB/core: Fix XRC race condition in ib_uverbs_open_qp")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:43 +0000 (13:31 +0300)]
IB/core: Add idr based standard types
This patch adds the standard idr based types. These types are
used in downstream patches in order to initialize, destroy and
lookup IB standard objects which are based on idr objects.
An idr object requires filling out several parameters. Its op pointer
should point to uverbs_idr_ops and its size should be at least the
size of ib_uobject. We add a macro to make the type declaration easier.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:42 +0000 (13:31 +0300)]
IB/core: Add support for idr types
The new ioctl infrastructure supports driver specific objects.
Each such object type has a hot unplug function, allocation size and
an order of destruction.
When a ucontext is created, a new list is created in this ib_ucontext.
This list contains all objects created under this ib_ucontext.
When a ib_ucontext is destroyed, we traverse this list several time
destroying the various objects by the order mentioned in the object
type description. If few object types have the same destruction order,
they are destroyed in an order opposite to their creation.
Adding an object is done in two parts.
First, an object is allocated and added to idr tree. Then, the
command's handlers (in downstream patches) could work on this object
and fill in its required details.
After a successful command, the commit part is called and the user
objects become ucontext visible. If the handler failed, alloc_abort
should be called.
Removing an uboject is done by calling lookup_get with the write flag
and finalizing it with destroy_commit. A major change from the previous
code is that we actually destroy the kernel object itself in
destroy_commit (rather than just the uobject).
We should make sure idr (per-uverbs-file) and list (per-ucontext) could
be accessed concurrently without corrupting them.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Matan Barak [Tue, 4 Apr 2017 10:31:41 +0000 (13:31 +0300)]
IB/core: Refactor idr to be per uverbs_file
The current code creates an idr per type. Since types are currently
common for all drivers and known in advance, this was good enough.
However, the proposed ioctl based infrastructure allows each driver
to declare only some of the common types and declare its own specific
types.
Thus, we decided to implement idr to be per uverbs_file.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Linus Torvalds [Mon, 20 Mar 2017 02:09:39 +0000 (19:09 -0700)]
Linux 4.11-rc3
Linus Torvalds [Mon, 20 Mar 2017 02:00:47 +0000 (19:00 -0700)]
mm/swap: don't BUG_ON() due to uninitialized swap slot cache
This BUG_ON() triggered for me once at shutdown, and I don't see a
reason for the check. The code correctly checks whether the swap slot
cache is usable or not, so an uninitialized swap slot cache is not
actually problematic afaik.
I've temporarily just switched the BUG_ON() to a WARN_ON_ONCE(), since
I'm not sure why that seemingly pointless check was there. I suspect
the real fix is to just remove it entirely, but for now we'll warn about
it but not bring the machine down.
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 20 Mar 2017 01:49:28 +0000 (18:49 -0700)]
Merge tag 'powerpc-4.11-5' of git://git./linux/kernel/git/powerpc/linux
Pull more powerpc fixes from Michael Ellerman:
"A couple of minor powerpc fixes for 4.11:
- wire up statx() syscall
- don't print a warning on memory hotplug when HPT resizing isn't
available
Thanks to: David Gibson, Chandan Rajendra"
* tag 'powerpc-4.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries: Don't give a warning when HPT resizing isn't available
powerpc: Wire up statx() syscall
Linus Torvalds [Mon, 20 Mar 2017 01:11:13 +0000 (18:11 -0700)]
Merge branch 'parisc-4.11-2' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- Mikulas Patocka added support for R_PARISC_SECREL32 relocations in
modules with CONFIG_MODVERSIONS.
- Dave Anglin optimized the cache flushing for vmap ranges.
- Arvind Yadav provided a fix for a potential NULL pointer dereference
in the parisc perf code (and some code cleanups).
- I wired up the new statx system call, fixed some compiler warnings
with the access_ok() macro and fixed shutdown code to really halt a
system at shutdown instead of crashing & rebooting.
* 'parisc-4.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix system shutdown halt
parisc: perf: Fix potential NULL pointer dereference
parisc: Avoid compiler warnings with access_ok()
parisc: Wire up statx system call
parisc: Optimize flush_kernel_vmap_range and invalidate_kernel_vmap_range
parisc: support R_PARISC_SECREL32 relocation in modules
Linus Torvalds [Mon, 20 Mar 2017 01:06:31 +0000 (18:06 -0700)]
Merge git://git./linux/kernel/git/nab/target-pending
Pull SCSI target fixes from Nicholas Bellinger:
"The bulk of the changes are in qla2xxx target driver code to address
various issues found during Cavium/QLogic's internal testing (stable
CC's included), along with a few other stability and smaller
miscellaneous improvements.
There are also a couple of different patch sets from Mike Christie,
which have been a result of his work to use target-core ALUA logic
together with tcm-user backend driver.
Finally, a patch to address some long standing issues with
pass-through SCSI export of TYPE_TAPE + TYPE_MEDIUM_CHANGER devices,
which will make folks using physical (or virtual) magnetic tape happy"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (28 commits)
qla2xxx: Update driver version to 9.00.00.00-k
qla2xxx: Fix delayed response to command for loop mode/direct connect.
qla2xxx: Change scsi host lookup method.
qla2xxx: Add DebugFS node to display Port Database
qla2xxx: Use IOCB interface to submit non-critical MBX.
qla2xxx: Add async new target notification
qla2xxx: Export DIF stats via debugfs
qla2xxx: Improve T10-DIF/PI handling in driver.
qla2xxx: Allow relogin to proceed if remote login did not finish
qla2xxx: Fix sess_lock & hardware_lock lock order problem.
qla2xxx: Fix inadequate lock protection for ABTS.
qla2xxx: Fix request queue corruption.
qla2xxx: Fix memory leak for abts processing
qla2xxx: Allow vref count to timeout on vport delete.
tcmu: Convert cmd_time_out into backend device attribute
tcmu: make cmd timeout configurable
tcmu: add helper to check if dev was configured
target: fix race during implicit transition work flushes
target: allow userspace to set state to transitioning
target: fix ALUA transition timeout handling
...
Linus Torvalds [Sun, 19 Mar 2017 22:45:02 +0000 (15:45 -0700)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull device-dax fixes from Dan Williams:
"The device-dax driver was not being careful to handle falling back to
smaller fault-granularity sizes.
The driver already fails fault attempts that are smaller than the
device's alignment, but it also needs to handle the cases where a
larger page mapping could be established. For simplicity of the
immediate fix the implementation just signals VM_FAULT_FALLBACK until
fault-size == device-alignment.
One fix is for -stable to address pmd-to-pte fallback from the
original implementation, another fix is for the new (introduced in
4.11-rc1) pud-to-pmd regression, and a typo fix comes along for the
ride.
These have received a build success notification from the kbuild
robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: fix debug output typo
device-dax: fix pud fault fallback handling
device-dax: fix pmd/pte fault fallback handling
Himanshu Madhani [Wed, 15 Mar 2017 16:48:56 +0000 (09:48 -0700)]
qla2xxx: Update driver version to 9.00.00.00-k
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:55 +0000 (09:48 -0700)]
qla2xxx: Fix delayed response to command for loop mode/direct connect.
Current driver wait for FW to be in the ready state before
processing in-coming commands. For Arbitrated Loop or
Point-to- Point (not switch), FW Ready state can take a while.
FW will transition to ready state after all Nports have been
logged in. In the mean time, certain initiators have completed
the login and starts IO. Driver needs to start processing all
queues if FW is already started.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:54 +0000 (09:48 -0700)]
qla2xxx: Change scsi host lookup method.
For target mode, when new scsi command arrive, driver first performs
a look up of the SCSI Host. The current look up method is based on
the ALPA portion of the NPort ID. For Cisco switch, the ALPA can
not be used as the index. Instead, the new search method is based
on the full value of the Nport_ID via btree lib.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Himanshu Madhani [Wed, 15 Mar 2017 16:48:53 +0000 (09:48 -0700)]
qla2xxx: Add DebugFS node to display Port Database
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Giridhar Malavali <giridhar.malavali@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:52 +0000 (09:48 -0700)]
qla2xxx: Use IOCB interface to submit non-critical MBX.
The Mailbox interface is currently over subscribed. We like
to reserve the Mailbox interface for the chip managment and
link initialization. Any non essential Mailbox command will
be routed through the IOCB interface. The IOCB interface is
able to absorb more commands.
Following commands are being routed through IOCB interface
- Get ID List (007Ch)
- Get Port DB (0064h)
- Get Link Priv Stats (006Dh)
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:51 +0000 (09:48 -0700)]
qla2xxx: Add async new target notification
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Anil Gurumurthy [Wed, 15 Mar 2017 16:48:50 +0000 (09:48 -0700)]
qla2xxx: Export DIF stats via debugfs
Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:49 +0000 (09:48 -0700)]
qla2xxx: Improve T10-DIF/PI handling in driver.
Add routines to support T10 DIF tag.
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Anil Gurumurthy <anil.gurumurthy@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:48 +0000 (09:48 -0700)]
qla2xxx: Allow relogin to proceed if remote login did not finish
If the remote port have started the login process, then the
PLOGI and PRLI should be back to back. Driver will allow
the remote port to complete the process. For the case where
the remote port decide to back off from sending PRLI, this
local port sets an expiration timer for the PRLI. Once the
expiration time passes, the relogin retry logic is allowed
to go through and perform login with the remote port.
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:47 +0000 (09:48 -0700)]
qla2xxx: Fix sess_lock & hardware_lock lock order problem.
The main lock that needs to be held for CMD or TMR submission
to upper layer is the sess_lock. The sess_lock is used to
serialize cmd submission and session deletion. The addition
of hardware_lock being held is not necessary. This patch removes
hardware_lock dependency from CMD/TMR submission.
Use hardware_lock only for error response in this case.
Path1
CPU0 CPU1
---- ----
lock(&(&ha->tgt.sess_lock)->rlock);
lock(&(&ha->hardware_lock)->rlock);
lock(&(&ha->tgt.sess_lock)->rlock);
lock(&(&ha->hardware_lock)->rlock);
Path2/deadlock
*** DEADLOCK ***
Call Trace:
dump_stack+0x85/0xc2
print_circular_bug+0x1e3/0x250
__lock_acquire+0x1425/0x1620
lock_acquire+0xbf/0x210
_raw_spin_lock_irqsave+0x53/0x70
qlt_sess_work_fn+0x21d/0x480 [qla2xxx]
process_one_work+0x1f4/0x6e0
Cc: <stable@vger.kernel.org>
Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reported-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:46 +0000 (09:48 -0700)]
qla2xxx: Fix inadequate lock protection for ABTS.
Normally, ABTS is sent to Target Core as Task MGMT command.
In the case of error, qla2xxx needs to send response, hardware_lock
is required to prevent request queue corruption.
Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:45 +0000 (09:48 -0700)]
qla2xxx: Fix request queue corruption.
When FW notify driver or driver detects low FW resource,
driver tries to send out Busy SCSI Status to tell Initiator
side to back off. During the send process, the lock was not held.
Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@qlogic.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Quinn Tran [Wed, 15 Mar 2017 16:48:44 +0000 (09:48 -0700)]
qla2xxx: Fix memory leak for abts processing
Cc: <stable@vger.kernel.org>
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Joe Carnuccio [Wed, 15 Mar 2017 16:48:43 +0000 (09:48 -0700)]
qla2xxx: Allow vref count to timeout on vport delete.
Cc: <stable@vger.kernel.org>
Signed-off-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Nicholas Bellinger [Sat, 18 Mar 2017 22:04:13 +0000 (15:04 -0700)]
tcmu: Convert cmd_time_out into backend device attribute
Instead of putting cmd_time_out under ../target/core/user_0/foo/control,
which has historically been used by parameters needed for initial
backend device configuration, go ahead and move cmd_time_out into
a backend device attribute.
In order to do this, tcmu_module_init() has been updated to create
a local struct configfs_attribute **tcmu_attrs, that is based upon
the existing passthrough_attrib_attrs along with the new cmd_time_out
attribute. Once **tcm_attrs has been setup, go ahead and point
it at tcmu_ops->tb_dev_attrib_attrs so it's picked up by target-core.
Also following MNC's previous change, ->cmd_time_out is stored in
milliseconds but exposed via configfs in seconds. Also, note this
patch restricts the modification of ->cmd_time_out to before +
after the TCMU device has been configured, but not while it has
active fabric exports.
Cc: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 9 Mar 2017 08:42:09 +0000 (02:42 -0600)]
tcmu: make cmd timeout configurable
A single daemon could implement multiple types of devices
using multuple types of real devices that may not support
restarting from crashes and/or handling tcmu timeouts. This
makes the cmd timeout configurable, so handlers that do not
support it can turn if off for now.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 9 Mar 2017 08:42:08 +0000 (02:42 -0600)]
tcmu: add helper to check if dev was configured
This adds a helper to check if the dev was configured. It
will be used in the next patch to prevent updates to some
config settings after the device has been setup.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Linus Torvalds [Sat, 18 Mar 2017 22:50:39 +0000 (15:50 -0700)]
Merge tag 'openrisc-for-linus' of git://github.com/openrisc/linux
Pull OpenRISC fixes from Stafford Horne:
"OpenRISC fixes for build issues that were exposed by kbuild robots
after 4.11 merge. All from allmodconfig builds. This includes:
- bug in the handling of 8-byte get_user() calls
- module build failure due to multile missing symbol exports"
* tag 'openrisc-for-linus' of git://github.com/openrisc/linux:
openrisc: Export symbols needed by modules
openrisc: fix issue handling 8 byte get_user calls
openrisc: xchg: fix `computed is not used` warning
Mike Christie [Thu, 2 Mar 2017 10:59:50 +0000 (04:59 -0600)]
target: fix race during implicit transition work flushes
This fixes the following races:
1. core_alua_do_transition_tg_pt could have read
tg_pt_gp_alua_access_state and gone into this if chunk:
if (!explicit &&
atomic_read(&tg_pt_gp->tg_pt_gp_alua_access_state) ==
ALUA_ACCESS_STATE_TRANSITION) {
and then core_alua_do_transition_tg_pt_work could update the
state. core_alua_do_transition_tg_pt would then only set
tg_pt_gp_alua_pending_state and the tg_pt_gp_alua_access_state would
not get updated with the second calls state.
2. core_alua_do_transition_tg_pt could be setting
tg_pt_gp_transition_complete while the tg_pt_gp_transition_work
is already completing. core_alua_do_transition_tg_pt then waits on the
completion that will never be called.
To handle these issues, we just call flush_work which will return when
core_alua_do_transition_tg_pt_work has completed so there is no need
to do the complete/wait. And, if core_alua_do_transition_tg_pt_work
was running, instead of trying to sneak in the state change, we just
schedule up another core_alua_do_transition_tg_pt_work call.
Note that this does not handle a possible race where there are multiple
threads call core_alua_do_transition_tg_pt at the same time. I think
we need a mutex in target_tg_pt_gp_alua_access_state_store.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 10:59:49 +0000 (04:59 -0600)]
target: allow userspace to set state to transitioning
Userspace target_core_user handlers like tcmu-runner may want to set the
ALUA state to transitioning while it does implicit transitions. This
patch allows that state when set from configfs.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 10:59:48 +0000 (04:59 -0600)]
target: fix ALUA transition timeout handling
The implicit transition time tells initiators the min time
to wait before timing out a transition. We currently schedule
the transition to occur in tg_pt_gp_implicit_trans_secs
seconds so there is no room for delays. If
core_alua_do_transition_tg_pt_work->core_alua_update_tpg_primary_metadata
needs to write out info to a remote file, then the initiator can
easily time out the operation.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 05:13:26 +0000 (23:13 -0600)]
target: Use system workqueue for ALUA transitions
If tcmu-runner is processing a STPG and needs to change the kernel's
ALUA state then we cannot use the same work queue for task management
requests and ALUA transitions, because we could deadlock. The problem
occurs when a STPG times out before tcmu-runner is able to
call into target_tg_pt_gp_alua_access_state_store->
core_alua_do_port_transition -> core_alua_do_transition_tg_pt ->
queue_work. In this case, the tmr is on the work queue waiting for
the STPG to complete, but the STPG transition is now queued behind
the waiting tmr.
Note:
This bug will also be fixed by this patch:
http://www.spinics.net/lists/target-devel/msg14560.html
which switches the tmr code to use the system workqueues.
For both, I am not sure if we need a dedicated workqueue since
it is not a performance path and I do not think we need WQ_MEM_RECLAIM
to make forward progress to free up memory like the block layer does.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 05:13:25 +0000 (23:13 -0600)]
target: fail ALUA transitions for pscsi
We do not setup the LU group for pscsi devices, so if you write
a state to alua_access_state that will cause a transition you will
get a NULL pointer dereference.
This patch will fail attempts to try and transition the path
for backend devices that set the TRANSPORT_FLAG_PASSTHROUGH_ALUA
flag.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 05:13:24 +0000 (23:13 -0600)]
target: allow ALUA setup for some passthrough backends
This patch allows passthrough backends to use the core/base LIO
ALUA setup and state checks, but still handle the execution of
commands.
This will allow the target_core_user module to execute STPG and RTPG
in userspace, and not have to duplicate the ALUA state checks, path
information (needed so we can check if command is executable on
specific paths) and setup (rtslib sets/updates the configfs ALUA
interface like it does for iblock or file).
For STPG, the target_core_user userspace daemon, tcmu-runner will
still execute the STPG, and to update the core/base LIO state it
will use the existing configfs interface. For RTPG, tcmu-runner
will loop over configfs and/or cache the state.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 05:14:40 +0000 (23:14 -0600)]
tcmu: return on first Opt parse failure
We only were returing failure if the last opt to be parsed failed.
This has a return failure when we first detect a failure.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Mike Christie [Thu, 2 Mar 2017 05:14:39 +0000 (23:14 -0600)]
tcmu: allow hw_max_sectors greater than 128
tcmu hard codes the hw_max_sectors to 128 which is a litle small.
Userspace uses the max_sectors to report the optimal IO size and
some initiators perform better with larger IOs (open-iscsi seems
to do better with 256 to 512 depending on the test).
(Fix do not display hw max sectors twice - MNC)
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Nicholas Bellinger [Wed, 8 Mar 2017 08:09:59 +0000 (00:09 -0800)]
target: Drop pointless tfo->check_stop_free check
All in-tree fabric drivers provide a tfo->check_stop_free(),
so there is no need to do the extra check within existing
transport_cmd_check_stop_to_fabric() code.
Just to be sure, add a check in target_fabric_tf_ops_check()
to notify any out-of-tree drivers that might be missing it.
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Helge Deller [Sat, 18 Mar 2017 16:13:27 +0000 (17:13 +0100)]
parisc: Fix system shutdown halt
On those parisc machines which don't provide a software power off
function, the system currently kills the init process at the end of a
shutdown and unexpectedly restarts insteads of halting.
Fix it by adding a loop which will not return.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # 4.9+
Arvind Yadav [Tue, 14 Mar 2017 09:54:51 +0000 (15:24 +0530)]
parisc: perf: Fix potential NULL pointer dereference
Fix potential NULL pointer dereference and clean up
coding style errors (code indent, trailing whitespaces).
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Linus Torvalds [Sat, 18 Mar 2017 15:33:44 +0000 (08:33 -0700)]
Merge branch 'smp-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
callback install/invocation machinery. Long standing bug caused by a
massive brain slip of that Gleixner dude, which went unnoticed for
almost a year"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Serialize callback invocations proper
Linus Torvalds [Sat, 18 Mar 2017 00:25:14 +0000 (17:25 -0700)]
Merge tag 'pm-4.11-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a few more intel_pstate issues and one small issue in the
cpufreq core.
Specifics:
- Fix breakage in the intel_pstate's debugfs interface for PID
controller tuning (Rafael Wysocki)
- Fix computations related to P-state limits in intel_pstate to avoid
excessive rounding errors leading to visible inaccuracies (Srinivas
Pandruvada, Rafael Wysocki)
- Add a missing newline to a message printed by one function in the
cpufreq core and clean up that function (Rafael Wysocki)"
* tag 'pm-4.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix and clean up show_cpuinfo_cur_freq()
cpufreq: intel_pstate: Avoid percentages in limits-related computations
cpufreq: intel_pstate: Correct frequency setting in the HWP mode
cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()
Rafael J. Wysocki [Fri, 17 Mar 2017 23:45:09 +0000 (00:45 +0100)]
Merge branches 'pm-cpufreq-fixes' and 'intel_pstate-fixes'
* pm-cpufreq-fixes:
cpufreq: Fix and clean up show_cpuinfo_cur_freq()
* intel_pstate-fixes:
cpufreq: intel_pstate: Avoid percentages in limits-related computations
cpufreq: intel_pstate: Correct frequency setting in the HWP mode
cpufreq: intel_pstate: Update pid_params.sample_rate_ns in pid_param_set()
Linus Torvalds [Fri, 17 Mar 2017 21:16:22 +0000 (14:16 -0700)]
Merge tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"We have a handful of stable fixes to fix kernel warnings and other
bugs that have been around for a while. We've also found a few other
reference counting bugs and memory leaks since the initial 4.11 pull.
Stable Bugfixes:
- Fix decrementing nrequests in NFS v4.2 COPY to fix kernel warnings
- Prevent a double free in async nfs4_exchange_id()
- Squelch a kbuild sparse complaint for xprtrdma
Other Bugfixes:
- Fix a typo (NFS_ATTR_FATTR_GROUP_NAME) that causes a memory leak
- Fix a reference leak that causes kernel warnings
- Make nfs4_cb_sv_ops static to fix a sparse warning
- Respect a server's max size in CREATE_SESSION
- Handle errors from nfs4_pnfs_ds_connect
- Flexfiles layout shouldn't mark devices as unavailable"
* tag 'nfs-for-4.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
pNFS: return status from nfs4_pnfs_ds_connect
NFSv4.1 respect server's max size in CREATE_SESSION
NFS prevent double free in async nfs4_exchange_id
nfs: make nfs4_cb_sv_ops static
xprtrdma: Squelch kbuild sparse complaint
NFS: fix the fault nrequests decreasing for nfs_inode COPY
NFSv4: fix a reference leak caused WARNING messages
nfs4: fix a typo of NFS_ATTR_FATTR_GROUP_NAME
Linus Torvalds [Fri, 17 Mar 2017 21:05:03 +0000 (14:05 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"An assorted pile of fixes along with some hardware enablement:
- a fix for a KASAN / branch profiling related boot failure
- some more fallout of the PUD rework
- a fix for the Always Running Timer which is not initialized when
the TSC frequency is known at boot time (via MSR/CPUID)
- a resource leak fix for the RDT filesystem
- another unwinder corner case fixup
- removal of the warning for duplicate NMI handlers because there are
legitimate cases where more than one handler can be registered at
the last level
- make a function static - found by sparse
- a set of updates for the Intel MID platform which got delayed due
to merge ordering constraints. It's hardware enablement for a non
mainstream platform, so there is no risk"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mpx: Make unnecessarily global function static
x86/intel_rdt: Put group node in rdtgroup_kn_unlock
x86/unwind: Fix last frame check for aligned function stacks
mm, x86: Fix native_pud_clear build error
x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y
x86/platform/intel-mid: Add power button support for Merrifield
x86/platform/intel-mid: Use common power off sequence
x86/platform: Remove warning message for duplicate NMI handlers
x86/tsc: Fix ART for TSC_KNOWN_FREQ
x86/platform/intel-mid: Correct MSI IRQ line for watchdog device
Linus Torvalds [Fri, 17 Mar 2017 21:01:40 +0000 (14:01 -0700)]
Merge branch 'x86-acpi-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 acpi fixes from Thomas Gleixner:
"This update deals with the fallout of the recent work to make
cpuid/node mappings persistent.
It turned out that the boot time ACPI based mapping tripped over ACPI
inconsistencies and caused regressions. It's partially reverted and
the fragile part replaced by an implementation which makes the mapping
persistent when a CPU goes online for the first time"
* 'x86-acpi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
acpi/processor: Check for duplicate processor ids at hotplug time
acpi/processor: Implement DEVICE operator for processor enumeration
x86/acpi: Restore the order of CPU IDs
Revert"x86/acpi: Enable MADT APIs to return disabled apicids"
Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"
Linus Torvalds [Fri, 17 Mar 2017 20:59:52 +0000 (13:59 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:
- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.
- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.
- a tooling fix for symbol handling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()
Linus Torvalds [Fri, 17 Mar 2017 20:19:07 +0000 (13:19 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"From the scheduler departement:
- a bunch of sched deadline related fixes which deal with various
buglets and corner cases.
- two fixes for the loadavg spikes which are caused by the delayed
NOHZ accounting"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use deadline instead of period when calculating overflow
sched/deadline: Throttle a constrained deadline task activated after the deadline
sched/deadline: Make sure the replenishment timer fires in the next period
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
sched/deadline: Add missing update_rq_clock() in dl_task_timer()
Linus Torvalds [Fri, 17 Mar 2017 20:16:24 +0000 (13:16 -0700)]
Merge branch 'locking-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Three fixes related to locking:
- fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
for the XCHGADD variant already
- plug a potential use after free in the futex code
- prevent leaking a held spinlock in an futex error handling code
path"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
futex: Add missing error handling to FUTEX_REQUEUE_PI
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI
Linus Torvalds [Fri, 17 Mar 2017 20:13:35 +0000 (13:13 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"Just a simple revert of a new sched_clock implementation which turned
out to be buggy"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock"
Weston Andros Adamson [Thu, 9 Mar 2017 17:56:49 +0000 (12:56 -0500)]
pNFS/flexfiles: never nfs4_mark_deviceid_unavailable
The flexfiles layout should never mark a device unavailable.
Move nfs4_mark_deviceid_unavailable out of nfs4_pnfs_ds_connect and call
directly from files layout where it's still needed.
The flexfiles driver still handles marked devices in error paths, but will
now print a rate limited warning.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Weston Andros Adamson [Thu, 9 Mar 2017 17:56:48 +0000 (12:56 -0500)]
pNFS: return status from nfs4_pnfs_ds_connect
The nfs4_pnfs_ds_connect path can call rpc_create which can fail or it
can wait on another context to reach the same failure.
This checks that the rpc_create succeeded and returns the error to the
caller.
When an error is returned, both the files and flexfiles layouts will return
NULL from _prepare_ds(). The flexfiles layout will also return the layout
with the error NFS4ERR_NXIO.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Olga Kornievskaia [Wed, 8 Mar 2017 19:39:15 +0000 (14:39 -0500)]
NFSv4.1 respect server's max size in CREATE_SESSION
Currently client doesn't respect max sizes server returns in CREATE_SESSION.
nfs4_session_set_rwsize() gets called and server->rsize, server->wsize are 0
so they never get set to the sizes returned by the server.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Olga Kornievskaia [Mon, 13 Mar 2017 14:36:19 +0000 (10:36 -0400)]
NFS prevent double free in async nfs4_exchange_id
Since rpc_task is async, the release function should be called which
will free the impl_id, scope, and owner.
Trond pointed at 2 more problems:
-- use of client pointer after free in the nfs4_exchangeid_release() function
-- cl_count mismatch if rpc_run_task() isn't run
Fixes: 8d89bd70bc9 ("NFS setup async exchange_id")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # 4.9
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Jason Yan [Fri, 10 Mar 2017 02:48:13 +0000 (10:48 +0800)]
nfs: make nfs4_cb_sv_ops static
Fixes the following sparse warning:
fs/nfs/callback.c:235:21: warning: symbol 'nfs4_cb_sv_ops' was not
declared. Should it be static?
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Sat, 11 Mar 2017 20:52:47 +0000 (15:52 -0500)]
xprtrdma: Squelch kbuild sparse complaint
New complaint from kbuild for 4.9.y:
net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
comparison expression (different type sizes)
verbs.c:
489 max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
I can't reproduce this running sparse here. Likewise, "make W=1
net/sunrpc/xprtrdma/verbs.o" never indicated any issue.
A little poking suggests that because the range of its values is
small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
smaller than the width of an unsigned integer.
Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Kinglong Mee [Thu, 9 Mar 2017 03:36:36 +0000 (11:36 +0800)]
NFS: fix the fault nrequests decreasing for nfs_inode COPY
The nfs_commit_file for NFSv4.2's COPY operation goes through
the commit path for normal WRITE, but without increase nrequests,
so, the nrequests decreased in nfs_commit_release_pages is fault.
After that, the nrequests will be wrong.
[ 5670.299881] ------------[ cut here ]------------
[ 5670.300295] WARNING: CPU: 0 PID: 27656 at fs/nfs/inode.c:127 nfs_clear_inode+0x66/0x90 [nfs]
[ 5670.300558] Modules linked in: nfsv4(E) nfs(E) fscache(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event ppdev f2fs coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_ens1371 intel_rapl_perf gameport snd_ac97_codec vmw_balloon ac97_bus snd_seq snd_pcm joydev snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm e1000 crc32c_intel mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: fscache]
[ 5670.302925] CPU: 0 PID: 27656 Comm: umount.nfs4 Tainted: G W E 4.11.0-rc1+ #519
[ 5670.303292] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 5670.304094] Call Trace:
[ 5670.304510] dump_stack+0x63/0x86
[ 5670.304917] __warn+0xcb/0xf0
[ 5670.305276] warn_slowpath_null+0x1d/0x20
[ 5670.305661] nfs_clear_inode+0x66/0x90 [nfs]
[ 5670.306093] nfs4_evict_inode+0x61/0x70 [nfsv4]
[ 5670.306480] evict+0xbb/0x1c0
[ 5670.306888] dispose_list+0x4d/0x70
[ 5670.307233] evict_inodes+0x178/0x1a0
[ 5670.307579] generic_shutdown_super+0x44/0xf0
[ 5670.307985] nfs_kill_super+0x21/0x40 [nfs]
[ 5670.308325] deactivate_locked_super+0x43/0x70
[ 5670.308698] deactivate_super+0x5a/0x60
[ 5670.309036] cleanup_mnt+0x3f/0x90
[ 5670.309407] __cleanup_mnt+0x12/0x20
[ 5670.309837] task_work_run+0x80/0xa0
[ 5670.310162] exit_to_usermode_loop+0x89/0x90
[ 5670.310497] syscall_return_slowpath+0xaa/0xb0
[ 5670.310875] entry_SYSCALL_64_fastpath+0xa7/0xa9
[ 5670.311197] RIP: 0033:0x7f1bb3617fe7
[ 5670.311545] RSP: 002b:
00007ffecbabb828 EFLAGS:
00000206 ORIG_RAX:
00000000000000a6
[ 5670.311906] RAX:
0000000000000000 RBX:
0000000001dca1f0 RCX:
00007f1bb3617fe7
[ 5670.312239] RDX:
000000000000000c RSI:
0000000000000001 RDI:
0000000001dc83c0
[ 5670.312653] RBP:
0000000001dc83c0 R08:
0000000000000001 R09:
0000000000000000
[ 5670.312998] R10:
0000000000000755 R11:
0000000000000206 R12:
00007ffecbabc66a
[ 5670.313335] R13:
0000000001dc83a0 R14:
0000000000000000 R15:
0000000000000000
[ 5670.313758] ---[ end trace
bf4bfe7764e4eb40 ]---
Cc: linux-kernel@vger.kernel.org
Fixes: 67911c8f18 ("NFS: Add nfs_commit_file()")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Linus Torvalds [Fri, 17 Mar 2017 19:16:44 +0000 (12:16 -0700)]
Merge tag 'afs-
20170316' of git://git./linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"Fixes to the AFS filesystem in the kernel.
They fix a variety of bugs. These include some issues fixed for
consistency with other AFS implementations:
- handle AFS mode bits better
- use the client mtime rather than the server mtime in the protocol
- handle the server returning more or less data than was requested in
a FetchData call
- distinguish mountpoints from symlinks based on the mode bits rather
than preemptively reading every symlink to find out what it
actually represents
One other notable change for the user is that files are now flushed on
close analogously with other network filesystems"
* tag 'afs-
20170316' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (28 commits)
afs: Don't wait for page writeback with the page lock held
afs: ->writepage() shouldn't call clear_page_dirty_for_io()
afs: Fix abort on signal while waiting for call completion
afs: Fix an off-by-one error in afs_send_pages()
afs: Fix afs_kill_pages()
afs: Fix page leak in afs_write_begin()
afs: Don't set PG_error on local EINTR or ENOMEM when filling a page
afs: Populate and use client modification time
afs: Better abort and net error handling
afs: Invalid op ID should abort with RXGEN_OPCODE
afs: Fix the maths in afs_fs_store_data()
afs: Use a bvec rather than a kvec in afs_send_pages()
afs: Make struct afs_read::remain 64-bit
afs: Fix AFS read bug
afs: Prevent callback expiry timer overflow
afs: Migrate vlocation fields to 64-bit
afs: security: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
afs: inode: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
afs: Distinguish mountpoints from symlinks by file mode alone
afs: Flush outstanding writes when an fd is closed
...
Linus Torvalds [Fri, 17 Mar 2017 19:14:49 +0000 (12:14 -0700)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"Just one change to add the statx syscall this time around"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: wire up statx syscall
Linus Torvalds [Fri, 17 Mar 2017 18:25:46 +0000 (11:25 -0700)]
Merge tag 'for-linus-4.11b-rc3-tag' of git://git./linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"A minor fix for using the appropriate refcount_t instead of atomic_t"
* tag 'for-linus-4.11b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drivers, xen: convert grant_map.users from atomic_t to refcount_t
Linus Torvalds [Fri, 17 Mar 2017 18:19:52 +0000 (11:19 -0700)]
Merge tag 'drm-fixes-for-v4.11-rc3' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Bunch of fixes across the drivers, in a St Patrick's day pull request
(please turn terminal colors to green on black or black on green for
full effect).
On the arm side, tilcdc, omap and malidp got fixes, while amd has some
powermanagement fixes, and intel has a set of fixes across the driver.
Nothing seems to bad or scary at this point"
* tag 'drm-fixes-for-v4.11-rc3' of git://people.freedesktop.org/~airlied/linux: (27 commits)
drm/amd/amdgpu: Fix debugfs reg read/write address width
drm/amdgpu/si: add dpm quirk for Oland
drm/radeon/si: add dpm quirk for Oland
drm: amd: remove broken include path
drm/amd/powerplay: fix copy error in smu7_clockpoweragting.c
drm/tilcdc: Set framebuffer DMA address to HW only if CRTC is enabled
drm/tilcdc: Fix hardcoded fail-return value in tilcdc_crtc_create()
drm/i915: Fix forcewake active domain tracking
drm/i915: Nuke skl_update_plane debug message from the pipe update critical section
drm/i915: use correct node for handling cache domain eviction
uapi: fix drm/omap_drm.h userspace compilation errors
drm/omap: fix dmabuf mmap for dma_alloc'ed buffers
drm/amdgpu: fix parser init error path to avoid crash in parser fini
drm/amd/amdgpu: Disable GFX_PG on Carrizo until compute issues solved
drm: mali-dp: Fix smart layer not going to composition
drm: mali-dp: Remove mclk rate management
drm/i915: Drain the freed state from the tail of the next commit
drm/i915: Nuke debug messages from the pipe update critical section
drm/i915: Use pagecache write to prepopulate shmemfs from pwrite-ioctl
drm/i915: Store a permanent error in obj->mm.pages
...
Ingo Molnar [Fri, 17 Mar 2017 14:13:28 +0000 (15:13 +0100)]
Merge tag 'perf-urgent-for-mingo-4.11-
20170317' of git://git./linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fix from Arnaldo Carvalho de Melo:
- Fix symbols__fixup_end heuristic for corner cases, such as JITted eBPF
programs, that are loaded at page aligned addresses, just after the
kernel proper (Daniel Borkmann)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann [Wed, 15 Mar 2017 21:53:37 +0000 (22:53 +0100)]
perf symbols: Fix symbols__fixup_end heuristic for corner cases
The current symbols__fixup_end() heuristic for the last entry in the rb
tree is suboptimal as it leads to not being able to recognize the symbol
in the call graph in a couple of corner cases, for example:
i) If the symbol has a start address (f.e. exposed via kallsyms)
that is at a page boundary, then the roundup(curr->start, 4096)
for the last entry will result in curr->start == curr->end with
a symbol length of zero.
ii) If the symbol has a start address that is shortly before a page
boundary, then also here, curr->end - curr->start will just be
very few bytes, where it's unrealistic that we could perform a
match against.
Instead, change the heuristic to roundup(curr->start, 4096) + 4096, so
that we can catch such corner cases and have a better chance to find
that specific symbol. It's still just best effort as the real end of the
symbol is unknown to us (and could even be at a larger offset than the
current range), but better than the current situation.
Alexei reported that he recently run into case i) with a JITed eBPF
program (these are all page aligned) as the last symbol which wasn't
properly shown in the call graph (while other eBPF program symbols in
the rb tree were displayed correctly). Since this is a generic issue,
lets try to improve the heuristic a bit.
Reported-and-Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 2e538c4a1847 ("perf tools: Improve kernel/modules symbol lookup")
Link: http://lkml.kernel.org/r/bb5c80d27743be6f12afc68405f1956a330e1bc9.1489614365.git.daniel@iogearbox.net
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Andy Lutomirski [Thu, 16 Mar 2017 19:59:40 +0000 (12:59 -0700)]
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
Naively, it looks racy, but ->mmap_sem saves it. Add a comment and a
lockdep assertion.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/03a1e629063899168dfc4707f3bb6e581e21f5c6.1489694270.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Lutomirski [Thu, 16 Mar 2017 19:59:39 +0000 (12:59 -0700)]
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
If one thread mmaps a perf event while another thread in the same mm
is in some context where active_mm != mm (which can happen in the
scheduler, for example), refresh_pce() would write the wrong value
to CR4.PCE. This broke some PAPI tests.
Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 7911d3f7af14 ("perf/x86: Only allow rdpmc if a perf_event is mapped")
Link: http://lkml.kernel.org/r/0c5b38a76ea50e405f9abe07a13dfaef87c173a1.1489694270.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Michael Ellerman [Fri, 17 Mar 2017 05:02:35 +0000 (16:02 +1100)]
powerpc/pseries: Don't give a warning when HPT resizing isn't available
As of commit
438cc81a41e8 ("powerpc/pseries: Automatically resize HPT
for memory hot add/remove"), when running on the pseries platform, we
always attempt to use the PAPR extension to resize the hashed page
table (HPT) when we add or remove memory.
This is fine, but when the extension is not available we'll give a
harmless, but scary warning. Instead check if the firmware supports HPT
resizing before populating the mmu_hash_ops.resize_hpt pointer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Linus Torvalds [Fri, 17 Mar 2017 01:23:02 +0000 (18:23 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers core: remove assert_held_device_hotplug()
mm: add private lock to serialize memory hotplug operations
mm: don't warn when vmalloc() fails due to a fatal signal
mm, x86: fix native_pud_clear build error
kasan: add a prototype of task_struct to avoid warning
z3fold: fix spinlock unlocking in page reclaim
Heiko Carstens [Thu, 16 Mar 2017 23:40:33 +0000 (16:40 -0700)]
drivers core: remove assert_held_device_hotplug()
The last caller of assert_held_device_hotplug() is gone, so remove it again.
Link: http://lkml.kernel.org/r/20170314125226.16779-3-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Heiko Carstens [Thu, 16 Mar 2017 23:40:30 +0000 (16:40 -0700)]
mm: add private lock to serialize memory hotplug operations
Commit
bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.
The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.
The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().
This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit
f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").
That commit was extended again with commit
b5d24fda9c3d ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.
In addition with commit
3fc21924100b ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.
This in turn triggers the following warning on s390:
WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
Call Trace:
assert_held_device_hotplug+0x40/0x58)
mem_hotplug_begin+0x34/0xc8
add_memory_resource+0x7e/0x1f8
add_memory+0xda/0x130
add_memory_merged+0x15c/0x178
sclp_detect_standby_memory+0x2ae/0x2f8
do_one_initcall+0xa2/0x150
kernel_init_freeable+0x228/0x2d8
kernel_init+0x2a/0x140
kernel_thread_starter+0x6/0xc
One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end(). But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).
Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.
To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.
Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Vyukov [Thu, 16 Mar 2017 23:40:27 +0000 (16:40 -0700)]
mm: don't warn when vmalloc() fails due to a fatal signal
When vmalloc() fails it prints a very lengthy message with all the
details about memory consumption assuming that it happened due to OOM.
However, vmalloc() can also fail due to fatal signal pending. In such
case the message is quite confusing because it suggests that it is OOM
but the numbers suggest otherwise. The messages can also pollute
console considerably.
Don't warn when vmalloc() fails due to fatal signal pending.
Link: http://lkml.kernel.org/r/20170313114425.72724-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Thu, 16 Mar 2017 23:40:24 +0000 (16:40 -0700)]
mm, x86: fix native_pud_clear build error
We still get a build error in random configurations, after this has been
modified a few times:
In file included from include/linux/mm.h:68:0,
from include/linux/suspend.h:8,
from arch/x86/kernel/asm-offsets.c:12:
arch/x86/include/asm/pgtable.h:66:26: error: redefinition of 'native_pud_clear'
#define pud_clear(pud) native_pud_clear(pud)
My interpretation is that the build error comes from a typo in
__PAGETABLE_PUD_FOLDED, so fix that typo now, and remove the incorrect
#ifdef around the native_pud_clear definition.
Fixes: 3e761a42e19c ("mm, x86: fix HIGHMEM64 && PARAVIRT build config for native_pud_clear()")
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Link: http://lkml.kernel.org/r/20170314121330.182155-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Ackedy-by: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Masami Hiramatsu [Thu, 16 Mar 2017 23:40:21 +0000 (16:40 -0700)]
kasan: add a prototype of task_struct to avoid warning
Add a prototype of task_struct to fix below warning on arm64.
In file included from arch/arm64/kernel/probes/kprobes.c:19:0:
include/linux/kasan.h:81:132: error: 'struct task_struct' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
static inline void kasan_unpoison_task_stack(struct task_struct *task) {}
As same as other types (kmem_cache, page, and vm_struct) this adds a
prototype of task_struct data structure on top of kasan.h.
[arnd] A related warning was fixed before, but now appears in a
different line in the same file in v4.11-rc2. The patch from Masami
Hiramatsu still seems appropriate, so let's take his version.
Fixes: 71af2ed5eeea ("kasan, sched/headers: Remove <linux/sched.h> from <linux/kasan.h>")
Link: https://patchwork.kernel.org/patch/9569839/
Link: http://lkml.kernel.org/r/20170313141517.3397802-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Alexander Potapenko <glider@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Thu, 16 Mar 2017 23:40:19 +0000 (16:40 -0700)]
z3fold: fix spinlock unlocking in page reclaim
Commmit
5a27aa822029 ("z3fold: add kref refcounting") introduced a bug
in z3fold_reclaim_page() with function exit that may leave pool->lock
spinlock held. Here comes the trivial fix.
Fixes: 5a27aa822029 ("z3fold: add kref refcounting")
Link: http://lkml.kernel.org/r/20170311222239.7b83d8e7ef1914e05497649f@gmail.com
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>