Tejun Heo [Tue, 18 Aug 2015 21:55:36 +0000 (14:55 -0700)]
blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.
* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.
* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.
* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.
* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:35 +0000 (14:55 -0700)]
blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:34 +0000 (14:55 -0700)]
blkcg: implement interface for the unified hierarchy
blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.
In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.
* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.
* The original interface consisted of 27 files is replaced with the
following three files.
blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits
Documentation/cgroups/unified-hierarchy.txt updated accordingly.
v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:33 +0000 (14:55 -0700)]
blkcg: misc preparations for unified hierarchy interface
* Export blkg_dev_name()
* Drop unnecessary @cft from __cfq_set_weight().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:32 +0000 (14:55 -0700)]
blkcg: separate out tg_conf_updated() from tg_set_conf()
tg_set_conf() is largely consisted of parsing and setting the new
config and the follow-up application and propagation. This patch
separates out the latter part into tg_conf_updated(). This will be
used to implement interface for the unified hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:31 +0000 (14:55 -0700)]
blkcg: move body parsing from blkg_conf_prep() to its callers
Currently, blkg_conf_prep() expects input to be of the following form
MAJ:MIN NUM
and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects
MAJ:MIN BODY_STR
where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.
To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.
This doesn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:30 +0000 (14:55 -0700)]
blkcg: mark existing cftypes as legacy
blkcg is about to grow interface for the unified hierarchy. Add
legacy to existing cftypes.
* blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
* blk-cgroup.c:blkcg_files -> blkcg_legacy_files
* cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
* blk-throttle.c:throtl_files -> throtl_legacy_files
Pure renames. No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:29 +0000 (14:55 -0700)]
blkcg: rename subsystem name from blkio to io
blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.
Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.
As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.
As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:28 +0000 (14:55 -0700)]
blkcg: refine error codes returned during blkcg configuration
blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that
* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:27 +0000 (14:55 -0700)]
blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
enabled are guaranteed to return non-NULL and the counterpart in
blk-throttle doesn't have these checks either. Remove the spurious
NULL checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:26 +0000 (14:55 -0700)]
blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
The recent percpu conversion of blkg_rwstat triggered the following
warning in certain configurations.
block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
This is because blkg_rwstat now contains four percpu_counter which can
be pretty big depending on debug options although it shouldn't be a
problem in production configs. This patch removes one of the two
local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
reduce stack usage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:25 +0000 (14:55 -0700)]
blkcg: remove cfqg_stats->sectors
cfq_stats->sectors is a blkg_stat which keeps track of the total
number of sectors serviced; however, this can be trivially calculated
from blkcg_gq->stat_bytes. The only thing necessary is adding up
READs and WRITEs and then dividing by sector size.
Remove cfqg_stats->sectors and make cfq print "sectors" and
"sectors_recursive" from stat_bytes.
While this is a bit more code, it removes duplicate stat allocations
and updates and ensures that the reported stats stay in tune with each
other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:24 +0000 (14:55 -0700)]
blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
Currently, both cfq-iosched and blk-throttle keep track of
io_service_bytes and io_serviced stats. While keeping track of them
separately may be useful during development, it doesn't make much
sense otherwise. Also, blk-throttle was counting bio's as IOs while
cfq-iosched request's, which is more confusing than informative.
This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
removes the counterparts from cfq-iosched and blk-throttle and let
them print from the common blkg counters. The common counters are
incremented during bio issue in blkcg_bio_issue_check().
The outputs are still filtered by whether the policy has
blkg_policy_data on a given blkg, so cfq's output won't show up if it
has never been used for a given blkg. The only times when the outputs
would differ significantly are when policies are attached on the fly
or elevators are switched back and forth. Those are quite exceptional
operations and I don't think they warrant keeping separate counters.
v3: Update blkio-controller.txt accordingly.
v2: Account IOs during bio issues instead of request completions so
that bio-based drivers can be handled the same way.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:23 +0000 (14:55 -0700)]
blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
Currently, blkg_[rw]stat_recursive_sum() assume that the target
counter is located in pd (blkg_policy_data); however, some counters
are planned to be moved to blkg (blkcg_gq).
This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
blkg_policy pointers instead of pd. If policy is NULL, it indexes
into blkg. If non-NULL, into the blkg's pd of the policy.
The existing usages are updated to maintain the current behaviors.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:22 +0000 (14:55 -0700)]
blkcg: make blkcg_[rw]stat per-cpu
blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
per-cpu wrapping in blk-throttle.
* blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
percpu_counter. This makes syncp unnecessary as remote accesses are
handled by percpu_counter itself.
* blkg_[rw]stat_init() can now fail due to percpu allocation failure
and thus are updated to return int.
* percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.
* As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
and summing results are stored in ->aux_cnt[] instead.
* Custom per-cpu stat implementation in blk-throttle is removed.
This makes all blkcg stat counters per-cpu without complicating policy
implmentations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:21 +0000 (14:55 -0700)]
blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
cgroup stats are local to each cgroup and doesn't propagate to
ancestors by default. When recursive stats are necessary, the sum is
calculated over all the descendants. This initially was for backward
compatibility to support both group-local and recursive stats but this
mode of operation makes general sense as stat update is much hotter
thafn reporting those stats.
This however ends up losing recursive stats when a child is removed.
To work around this, cfq-iosched adds its stats to its parent
cfq_group->dead_stats which is summed up together when calculating
recursive stats.
It's planned that the core stats will be moved to blkcg_gq, so we want
to move the mechanism for keeping track of the stats of dead children
from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
are atomic64_t's keeping track of auxiliary counts which are excluded
when reading local counts but included for recursive.
blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
a dead cgroup to the aux counts of parent->stats instead of separate
->dead_stats.
This will also help making blkg_[rw]stats per-cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:20 +0000 (14:55 -0700)]
blkcg: consolidate blkg creation in blkcg_bio_issue_check()
blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies. Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.
This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks(). blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's. The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.
* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
instead of blkg_lookup_create().
* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
read locked and blkg already looked up. Both throtl_lookup_tg() and
throtl_lookup_create_tg() are dropped.
* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
cfq_lookup_cfqg()which uses blkg_lookup().
This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure. In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.
v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
!CONFIG_BLK_DEV_THROTTLING.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:19 +0000 (14:55 -0700)]
blk-throttle: improve queue bypass handling
If a queue is bypassing, all blkcg policies should become noops but
blk-throttle wasn't. It only became noop if the queue was dying.
While this wouldn't lead to an oops as falling back to the root blkg
is safe in this case, this can be a bit surprising - a bypassing queue
could still be applying throttle limits.
Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
before doing anything else.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:18 +0000 (14:55 -0700)]
blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
Currently, both throttle and cfq policies implement their own root
blkg (blkcg_gq) lookup fast path. This patch moves root blkg
optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched
currently doesn't use blkg_lookup() but will be converted and drop the
optimization too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:17 +0000 (14:55 -0700)]
blkcg: inline [__]blkg_lookup()
blkg_lookup() checks whether the target queue is bypassing and, if
not, calls __blkg_lookup() which first checks the lookup hint and then
performs radix tree walk. The operations upto hint checking are
trivial and there are many users of this function. This patch inlines
blkg_lookup() and the fast path part of __blkg_lookup(). The radix
tree lookup and hint update are now in blkg_lookup_slowpath().
This will help consolidating blkg handling by easing moving root blkcg
short-circuit to inlined lookup fast path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:16 +0000 (14:55 -0700)]
blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods
Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
cpd's were allocated by blkcg core and each policy could request to
allocate extra space at the end by setting blkcg_policy->cpd_size
larger than the size of cpd.
This is a bit unusual but blkg (blkcg_gq) policy data used to be
handled this way too so it made sense to be consistent; however, blkg
policy data switched to alloc/free callbacks.
This patch makes similar changes to cpd handling.
blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
cpd allocation is now done from policy side, it can simply allocate a
larger area which embeds cpd at the beginning.
As ->cpd_alloc_fn() may be able to perform all necessary
initializations, this patch makes ->cpd_init_fn() optional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:15 +0000 (14:55 -0700)]
blkcg: minor updates around blkcg_policy_data
* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
for blkcg_policy_data.
* Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
blkcg. This makes it consistent with blkg_policy_data methods and
to-be-added cpd alloc/free methods.
* blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
cpd_init_fn() can determine the associated blkcg from
blkcg_policy_data.
v2: blkcg_policy_data->blkcg initializations were missing. Added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:14 +0000 (14:55 -0700)]
blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data
The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
(blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
can always be mapped to blkg and given that these are policy-specific
methods, it makes sense to converge on pd.
This patch makes all methods deal with pd instead of blkg. Most
conversions are trivial. In blk-cgroup.c, a couple method invocation
sites now test whether pd exists instead of policy state for
consistency. This shouldn't cause any behavioral differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:13 +0000 (14:55 -0700)]
blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods
With the recent addition of alloc and free methods, things became
messier. This patch reorganizes them according to the followings.
* ->pd_alloc_fn()
Responsible for allocation and static initializations - the ones
which can be done independent of where the pd might be attached.
* ->pd_init_fn()
Initializations which require the knowledge of where the pd is
attached.
* ->pd_free_fn()
The counter part of pd_alloc_fn(). Static de-init and freeing.
This leaves ->pd_exit_fn() without any users. Removed.
While at it, collapse an one liner function throtl_pd_exit(), which
has only one user, into its user.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:12 +0000 (14:55 -0700)]
blk-throttle: remove asynchrnous percpu stats allocation mechanism
Because percpu allocator couldn't do non-blocking allocations,
blk-throttle was forced to implement an ad-hoc asynchronous allocation
mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
allocated from an IO path without sleepable context.
Now that percpu allocator can handle gfp_mask and blkg_policy_data
alloc / free are handled by policy methods, the ad-hoc asynchronous
allocation mechanism can be replaced with direct allocation from
tg_stats_alloc_fn(). Rit it out.
This ensures that an active throtl_grp always has valid non-NULL
->stats_cpu. Remove checks on it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:11 +0000 (14:55 -0700)]
blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods
A blkg (blkcg_gq) represents the relationship between a cgroup and
request_queue. Each active policy has a pd (blkg_policy_data) on each
blkg. The pd's were allocated by blkcg core and each policy could
request to allocate extra space at the end by setting
blkcg_policy->pd_size larger than the size of pd.
This is a bit unusual but was done this way mostly to simplify error
handling and all the existing use cases could be handled this way;
however, this is becoming too restrictive now that percpu memory can
be allocated without blocking.
This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
and pd_free_fn() - which are used to allocate and release pd for a
given policy. As pd allocation is now done from policy side, it can
simply allocate a larger area which embeds pd at the beginning. This
change makes ->pd_size pointless. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:10 +0000 (14:55 -0700)]
blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn
blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't. As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything. Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:09 +0000 (14:55 -0700)]
blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy()
When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's). Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.
This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure. If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration. This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:08 +0000 (14:55 -0700)]
blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths
blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root. There's no
reason to to treat policy data any differently for blkcg_root. If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will. The free path isn't never invoked for root csses.
This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:07 +0000 (14:55 -0700)]
blkcg: use blkg_free() in blkcg_init_queue() failure path
When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas. Make it use blkg_free() instead. In turn, blkg_free() is
updated to handle root request_list special case.
While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too. I don't think it needs to be tagged for -stable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:06 +0000 (14:55 -0700)]
blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl()
Since
ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root
blkcg"), a request_list always has its blkg associated. Drop
unnecessary rl->blkg NULL test from blk_put_rl().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:05 +0000 (14:55 -0700)]
cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root
Up until now, all async IOs were queued to async queues which are
shared across the whole request_queue, which means that blkcg resource
control is completely void on async IOs including all writeback IOs.
It was done this way because writeback didn't support writeback and
there was no way of telling which writeback IO belonged to which
cgroup; however, writeback recently became cgroup aware and writeback
bio's are sent down properly tagged with the blkcg's to charge them
against.
This patch makes async cfq_queues per-cfq_cgroup instead of
per-cfq_data so that each async IO is charged to the blkcg that it was
tagged for instead of unconditionally attributing it to root.
* cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
and alloc / destroy paths are updated accordingly.
* cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
queues.
* check_blkcg_changed() now also invalidates async queues as they no
longer stay the same across cgroups.
After this patch, cfq's proportional IO control through blkio.weight
works correctly when cgroup writeback is in use.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:04 +0000 (14:55 -0700)]
cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue()
cfq_find_alloc_queue() checks whether a queue actually needs to be
allocated, which is unnecessary as its sole caller, cfq_get_queue(),
only calls it if so. Also, the oom queue fallback logic is scattered
between cfq_get_queue() and cfq_find_alloc_queue(). There really
isn't much going on in the latter and things can be made simpler by
folding it into cfq_get_queue().
This patch collapses cfq_find_alloc_queue() into cfq_get_queue(). The
change is fairly straight-forward with one exception - async_cfqq is
now initialized to NULL and the "!is_sync" test in the last if
conditional is replaced with "async_cfqq" test. This is because gcc
(5.1.1) gets confused for some reason and warns that async_cfqq may be
used uninitialized otherwise. Oh well, the code isn't necessarily
worse this way.
This patch doesn't cause any functional difference.
v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:03 +0000 (14:55 -0700)]
cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue()
This is necessary for making async cfq_cgroups per-cfq_group instead
of per-cfq_data. While this change makes cfq_get_queue() perform RCU
locking and look up cfq_group even when it reuses async queue, the
extra overhead is extremely unlikely to be noticeable given that this
is already sitting behind cic->cfqq[] cache and the overall cost of
cfq operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:02 +0000 (14:55 -0700)]
cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue()
Even when allocations fail, cfq_find_alloc_queue() always returns a
valid cfq_queue by falling back to the oom cfq_queue. As such, there
isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
is set. GFP_NOWAIT allocations don't fail often and even when they do
the degraded behavior is acceptable and temporary.
After all, the only reason get_request(), which ultimately determines
the gfp_mask, cares about __GFP_WAIT is to guarantee request
allocation, assuming IO forward progress, for callers which are
willing to wait. There's no reason for cfq_find_alloc_queue() to
behave differently on __GFP_WAIT when it already has a fallback
mechanism.
Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
to its callers. This simplifies the function quite a bit and will
help making async queues per-cfq_group.
v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:01 +0000 (14:55 -0700)]
blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations
blkcg performs several allocations to track IOs per cgroup and enforce
resource control. Most of these allocations are performed lazily on
demand in the IO path and thus can't involve reclaim path. Currently,
these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
with occassional failures of these allocations by punting IOs to the
root cgroup and there's no reason to reach into the emergency reserve.
This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
allocations.
* bdi_writeback_congested and blkcg_gq allocations in blkg_create().
* radix tree node allocations for blkcg->blkg_tree.
* cfq_queue allocation on ioprio changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:55:00 +0000 (14:55 -0700)]
cfq-iosched: minor cleanups
* Some were accessing cic->cfqq[] directly. Always use cic_to_cfqq()
and cic_set_cfqq().
* check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
return for NULL. It's always non-NULL. Simplify accordingly.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:59 +0000 (14:54 -0700)]
cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request()
If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
replaces it by invoking cfq_get_queue() again without putting the oom
queue leaking the reference it was holding. While oom queues are not
released through reference counting, they're still reference counted
and this can theoretically lead to the reference count overflowing and
incorrectly invoke the usual release path on it.
Fix it by making cfq_set_request() put the ref it was holding.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:58 +0000 (14:54 -0700)]
cfq-iosched: fix async oom queue handling
Async cfqq's (cfq_queue's) are shared across cfq_data. When
cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
stashes the pointer in cfq_data and reuses it from then on; however,
the function doesn't consider that cfq_find_alloc_queue() may return
the oom_cfqq under memory pressure and installs the returned queue
unconditionally.
If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
continue calling cfq_get_queue() hoping to replace it with a proper
queue; however, cfq_get_queue() will keep returning the cached queue
for the slot - the oom_cfqq.
Fix it by skipping caching if the queue is the oom one.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:57 +0000 (14:54 -0700)]
cfq-iosched: simplify control flow in cfq_get_queue()
cfq_get_queue()'s control flow looks like the following.
async_cfqq = NULL;
cfqq = NULL;
if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
}
if (!cfqq)
cfqq = ...;
if (!is_sync && !(*async_cfqq))
...;
The only thing the local variable init, the second if, and the
async_cfqq test in the third if achieves is to skip cfqq creation and
installation if *async_cfqq was already non-NULL. This is needlessly
complicated with different tests examining the same condition.
Simplify it to the following.
if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
if (cfqq)
goto out;
}
cfqq = ...;
if (!is_sync)
...;
out:
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:56 +0000 (14:54 -0700)]
writeback: update writeback tracepoints to report cgroup
The following tracepoints are updated to report the cgroup used during
cgroup writeback.
* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]
Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it. Tracepoints which take
bdi are updated to take bdi_writeback instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:55 +0000 (14:54 -0700)]
kernfs: implement kernfs_path_len()
Add a function to determine the path length of a kernfs node. This
for now will be used by writeback tracepoint updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:54 +0000 (14:54 -0700)]
writeback: explain why @inode is allowed to be NULL for inode_congested()
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:53 +0000 (14:54 -0700)]
writeback: remove wb_writeback_work->single_wait/done
wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.
This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Tejun Heo [Tue, 18 Aug 2015 21:54:52 +0000 (14:54 -0700)]
writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg
wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
earlier implementation, wb's were keyed by blkcg ID.
bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
allows iterations to start from an arbitrary ID which is used to
interrupt and resume iterations.
Unfortunately, while changing wb to be keyed by memcg ID instead of
blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
are keyed by blkcg ID. This doesn't affect iterations which don't get
interrupted but bdi_split_work_to_wbs() makes use of iteration
resuming on allocation failures and thus may incorrectly skip or
repeat wb's.
Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 18 Aug 2015 22:49:11 +0000 (15:49 -0700)]
Merge branch 'for-4.3-unified-base' of git://git./linux/kernel/git/tj/cgroup into for-4.3/blkcg
Tejun Heo [Tue, 18 Aug 2015 20:58:16 +0000 (13:58 -0700)]
cgroup: introduce cgroup_subsys->legacy_name
This allows cgroup subsystems to use a different name on the unified
hierarchy. cgroup_subsys->name is used on the unified hierarchy,
->legacy_name elsewhere. If ->legacy_name is not explicitly set, it's
automatically set to ->name and the userland visible behavior remains
unchanged.
v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
options are used only on legacy hierarchies. Suggested by Li
Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Tejun Heo [Tue, 18 Aug 2015 20:58:16 +0000 (13:58 -0700)]
cgroup: don't print subsystems for the default hierarchy
It doesn't make sense to print subsystems on mount option or
/proc/PID/cgroup for the default hierarchy.
* cgroup.controllers file at the root of the default hierarchy lists
the currently attached controllers.
* The default hierarchy is catch-all for unmounted subsystems.
* The default hierarchy doesn't accept any mount options.
Suppress subsystem printing on mount options and /proc/PID/cgroup for
the default hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Tejun Heo [Tue, 11 Aug 2015 17:35:42 +0000 (13:35 -0400)]
cgroup: make cftype->private a unsigned long
It's pretty unusual to have an int as a private data field and it
makes it impossible to carray a pointer value through it. Let's make
it an unsigned long. AFAICS, this shouldn't break anything.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tejun Heo [Wed, 5 Aug 2015 20:03:19 +0000 (16:03 -0400)]
cgroup: export cgrp_dfl_root
While cgroup subsystems can't be modules, blkcg supports dynamically
loadable policies which interact with cgroup core. Export
cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Tejun Heo [Tue, 4 Aug 2015 19:20:55 +0000 (15:20 -0400)]
cgroup: define controller file conventions
Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.
Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes. Let's take
the chance to establish common conventions throughout all controllers.
This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy. Except for io.weight knob,
all existing unified hierarchy knobs are already compliant. A
follow-up patch will update io.weight.
v2: Added descriptions of min, low and high knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Jens Axboe [Mon, 27 Jul 2015 17:58:41 +0000 (11:58 -0600)]
Merge branch 'stable/for-jens-4.2' of git://git./linux/kernel/git/konrad/xen into for-linus
Konrad writes:
"There are three bugs that have been found in the xen-blkfront (and
backend). Two of them have the stable tree CC-ed. They have been found
where an guest is migrating to a host that is missing
'feature-persistent' support (from one that has it enabled). We end up
hitting an BUG() in the driver code."
Linus Torvalds [Sun, 26 Jul 2015 19:26:21 +0000 (12:26 -0700)]
Linux 4.2-rc4
Linus Torvalds [Sun, 26 Jul 2015 18:46:32 +0000 (11:46 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"A single fix for the intel cqm perf facility to prevent IPIs from
interrupt context"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/cqm: Return cached counter value from IRQ context
Linus Torvalds [Sun, 26 Jul 2015 18:14:04 +0000 (11:14 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- the manual revert of the SYSCALL32 changes which caused a
regression
- a fix for the MPX vma handling
- three fixes for the ioremap 'is ram' checks.
- PAT warning fixes
- a trivial fix for the size calculation of TLB tracepoints
- handle old EFI structures gracefully
This also contains a PAT fix from Jan plus a revert thereof. Toshi
explained why the code is correct"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Revert 'Adjust default caching mode translation tables'
x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
x86/mm: Fix newly introduced printk format warnings
mm: Fix bugs in region_is_ram()
x86/mm: Remove region_is_ram() call from ioremap
x86/mm: Move warning from __ioremap_check_ram() to the call site
x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
x86/mm/pat: Adjust default caching mode translation tables
x86/fpu: Disable dependent CPU features on "noxsave"
x86/mpx: Do not set ->vm_ops on MPX VMAs
x86/mm: Add parenthesis for TLB tracepoint size calculation
efi: Handle memory error structures produced based on old versions of standard
Thomas Gleixner [Sun, 26 Jul 2015 08:27:37 +0000 (10:27 +0200)]
x86/mm/pat: Revert 'Adjust default caching mode translation tables'
Toshi explains:
"No, the default values need to be set to the fallback types,
i.e. minimal supported mode. For WC and WT, UC is the fallback type.
When PAT is disabled, pat_init() does update the tables below to
enable WT per the default BIOS setup. However, when PAT is enabled,
but CPU has PAT -errata, WT falls back to UC per the default values."
Revert:
ca1fec58bc6a 'x86/mm/pat: Adjust default caching mode translation tables'
Requested-by: Toshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.de>
Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Matt Fleming [Tue, 21 Jul 2015 14:55:09 +0000 (15:55 +0100)]
perf/x86/intel/cqm: Return cached counter value from IRQ context
Peter reported the following potential crash which I was able to
reproduce with his test program,
[ 148.765788] ------------[ cut here ]------------
[ 148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
[ 148.765797] Modules linked in:
[ 148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[ 148.765803]
ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
[ 148.765805]
0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
[ 148.765807]
ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
[ 148.765809] Call Trace:
[ 148.765810] <NMI> [<
ffffffff818bdfd5>] dump_stack+0x45/0x57
[ 148.765818] [<
ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
[ 148.765822] [<
ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[ 148.765824] [<
ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[ 148.765825] [<
ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
[ 148.765827] [<
ffffffff811613f6>] smp_call_function_many+0xb6/0x260
[ 148.765829] [<
ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[ 148.765831] [<
ffffffff81161748>] on_each_cpu_mask+0x28/0x60
[ 148.765832] [<
ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
[ 148.765836] [<
ffffffff811cdd35>] perf_output_read+0x2a5/0x400
[ 148.765839] [<
ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
[ 148.765840] [<
ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
[ 148.765841] [<
ffffffff811d3497>] perf_event_output+0x47/0x60
[ 148.765843] [<
ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
[ 148.765844] [<
ffffffff811d4124>] perf_event_overflow+0x14/0x20
[ 148.765847] [<
ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
[ 148.765849] [<
ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[ 148.765853] [<
ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
[ 148.765854] [<
ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
[ 148.765859] [<
ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[ 148.765863] [<
ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
[ 148.765865] [<
ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
[ 148.765869] [<
ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
[ 148.765872] [<
ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
[ 148.765875] [<
ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
[ 148.765877] [<
ffffffff81063ed9>] nmi_handle+0x79/0x100
[ 148.765879] [<
ffffffff81064422>] default_do_nmi+0x42/0x100
[ 148.765880] [<
ffffffff81064563>] do_nmi+0x83/0xb0
[ 148.765884] [<
ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
[ 148.765886] [<
ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[ 148.765888] [<
ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[ 148.765890] [<
ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[ 148.765891] <<EOE>> [<
ffffffff8110ab66>] finish_task_switch+0x156/0x210
[ 148.765898] [<
ffffffff818c1671>] __schedule+0x341/0x920
[ 148.765899] [<
ffffffff818c1c87>] schedule+0x37/0x80
[ 148.765903] [<
ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
[ 148.765905] [<
ffffffff818c1f4a>] schedule_user+0x1a/0x50
[ 148.765907] [<
ffffffff818c666c>] retint_careful+0x14/0x32
[ 148.765908] ---[ end trace
e33ff2be78e14901 ]---
The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".
Make do with the last read counter value currently event in
event->count when we're invoked in this context.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Will Auld <will.auld@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Linus Torvalds [Sun, 26 Jul 2015 03:11:12 +0000 (20:11 -0700)]
Merge tag 'usb-4.2-rc4' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here's a few USB and PHY fixes for 4.2-rc4.
Nothing major, the shortlog has the full details.
All of these have been in linux-next successfully"
* tag 'usb-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (21 commits)
USB: OHCI: fix bad #define in ohci-tmio.c
cdc-acm: Destroy acm_minors IDR on module exit
usb-storage: Add ignore-device quirk for gm12u320 based usb mini projectors
usb-storage: ignore ZTE MF 823 card reader in mode 0x1225
USB: OHCI: Fix race between ED unlink and URB submission
usb: core: lpm: set lpm_capable for root hub device
xhci: do not report PLC when link is in internal resume state
xhci: prevent bus_suspend if SS port resuming in phase 1
xhci: report U3 when link is in resume state
xhci: Calculate old endpoints correctly on device reset
usb: xhci: Bugfix for NULL pointer deference in xhci_endpoint_init() function
xhci: Workaround to get D3 working in Intel xHCI
xhci: call BIOS workaround to enable runtime suspend on Intel Braswell
usb: dwc3: Reset the transfer resource index on SET_INTERFACE
usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU
usb: gadget: mv_udc_core: fix phy_regs I/O memory leak
usb: ulpi: ulpi_init should be executed in subsys_initcall
phy: berlin-usb: fix divider for BG2
phy: berlin-usb: fix divider for BG2CD
phy/pxa: add HAS_IOMEM dependency
...
Linus Torvalds [Sun, 26 Jul 2015 03:05:07 +0000 (20:05 -0700)]
Merge tag 'tty-4.2-rc4' of git://git./linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a number of small serial and tty fixes for reported issues.
All have been in linux-next successfully"
* tag 'tty-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: vt: Fix !TASK_RUNNING diagnostic warning from paste_selection()
serial: core: Fix crashes while echoing when closing
m32r: Add ioreadXX/iowriteXX big-endian mmio accessors
Revert "serial: imx: initialized DMA w/o HW flow enabled"
sc16is7xx: fix FIFO address of secondary UART
sc16is7xx: fix Kconfig dependencies
serial: etraxfs-uart: Fix release etraxfs_uart_ports
tty/vt: Fix the memory leak in visual_init
serial: amba-pl011: Fix devm_ioremap_resource return value check
n_tty: signal and flush atomically
Linus Torvalds [Sun, 26 Jul 2015 03:03:10 +0000 (20:03 -0700)]
Merge tag 'staging-4.2-rc4' of git://git./linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are a number of iio and staging driver fixes for reported issues
for 4.2-rc4.
All have been in linux-next for a while with no problems"
* tag 'staging-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (34 commits)
iio:light:stk3310: make endianness independent of host
iio:light:stk3310: move device register to end of probe
iio: mma8452: use iio event type IIO_EV_TYPE_MAG
iio: mcp320x: Fix NULL pointer dereference
iio: adc: vf610: fix the adc register read fail issue
iio: mlx96014: Replace offset sign
iio: magnetometer: mmc35240: fix SET/RESET sequence
iio: magnetometer: mmc35240: Fix SET/RESET mask
iio: magnetometer: mmc35240: Fix crash in pm suspend
iio:magnetometer:bmc150_magn: output intended variable
iio:magnetometer:bmc150_magn: add regmap dependency
staging: vt6656: check ieee80211_bss_conf bssid not NULL
staging: vt6655: check ieee80211_bss_conf bssid not NULL
iio: tmp006: Check channel info on write
iio: sx9500: Add missing init in sx9500_buffer_pre{en,dis}able()
iio:light:ltr501: fix regmap dependency
iio:light:ltr501: fix variable in ltr501_init
iio: sx9500: fix bug in compensation code
iio: sx9500: rework error handling of raw readings
iio: magnetometer: mmc35240: fix available sampling frequencies
...
Linus Torvalds [Sun, 26 Jul 2015 02:50:59 +0000 (19:50 -0700)]
Merge tag 'char-misc-4.2-rc4' of git://git./linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some char and misc driver fixes for reported issues.
One parport patch is reverted as it was incorrect, thanks to testing
by the 0-day bot"
* tag 'char-misc-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
parport: Revert "parport: fix memory leak"
mei: prevent unloading mei hw modules while the device is opened.
misc: mic: scif bug fix for vmalloc_to_page crash
parport: fix freeing freed memory
parport: fix memory leak
parport: fix error handling
Sudip Mukherjee [Sat, 25 Jul 2015 07:49:40 +0000 (13:19 +0530)]
parport: Revert "parport: fix memory leak"
This reverts commit
23c405912b88 ("parport: fix memory leak")
par_dev->state was already being removed in parport_unregister_device().
Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Linus Torvalds [Sat, 25 Jul 2015 18:42:54 +0000 (11:42 -0700)]
Merge tag 'trace-v4.2-rc2-fix3' of git://git./linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Back in 3.16 the ftrace code was redesigned and cleaned up to remove
the double iteration list (one for registered ftrace ops, and one for
registered "global" ops), to just use one list. That simplified the
code but also broke the function tracing filtering on pid.
This updates the code to handle the filtering again with the new
logic"
* tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix breakage of set_ftrace_pid
Linus Torvalds [Sat, 25 Jul 2015 18:36:12 +0000 (11:36 -0700)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/djbw/nvdimm
Pull libnvdimm fix from Dan Williams:
"A minor fix for the libnvdimm subsystem.
This is not critical. The problem can be worked around in userspace
by putting the namespace temporarily into raw mode
(ndctl_namespace_set_raw_mode() from libndctl), but that is awkward
for management utilities.
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm:
libnvdimm: fix namespace seed creation
Linus Torvalds [Sat, 25 Jul 2015 18:24:58 +0000 (11:24 -0700)]
Merge tag 'md/4.2-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Some md fixes for 4.2
Several are tagged for -stable.
A few aren't because they are not very, serious or because they are in
the 'experimental' cluster code"
* tag 'md/4.2-fixes' of git://neil.brown.name/md:
md/raid5: clear R5_NeedReplace when no longer needed.
Fix read-balancing during node failure
md-cluster: fix bitmap sub-offset in bitmap_read_sb
md: Return error if request_module fails and returns positive value
md: Skip cluster setup in case of error while reading bitmap
md/raid1: fix test for 'was read error from last working device'.
md: Skip cluster setup for dm-raid
md: flush ->event_work before stopping array.
md/raid10: always set reshape_safe when initializing reshape_position.
md/raid5: avoid races when changing cache size.
Linus Torvalds [Sat, 25 Jul 2015 18:19:38 +0000 (11:19 -0700)]
Merge tag 'for-linus-
20150724' of git://git.infradead.org/linux-mtd
Pull MTD fixes from Brian Norris:
"Two trivial updates. I meant to send these much earlier, but I've
been preoccupied.
- Add MAINTAINERS entry for diskonchip g3 driver
- Fix an overlooked conflict in bitfield value assignments
The latter update is a bit overdue, but there's no reason to wait any
longer"
* tag 'for-linus-
20150724' of git://git.infradead.org/linux-mtd:
mtd: nand: Fix NAND_USE_BOUNCE_BUFFER flag conflict
MAINTAINERS: mtd: docg3: add docg3 maintainer
Dan Williams [Sat, 25 Jul 2015 03:42:34 +0000 (23:42 -0400)]
libnvdimm: fix namespace seed creation
A new BLK namespace "seed" device is created whenever the current seed
is successfully probed. However, if that namespace is assigned to a BTT
it may never directly experience a successful probe as it is a
subordinate device to a BTT configuration.
The effect of the current code is that no new namespaces can be
instantiated, after the seed namespace, to consume available BLK DPA
capacity. Fix this by treating a successful BTT probe event as a
successful probe event for the backing namespace.
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Linus Torvalds [Sat, 25 Jul 2015 00:00:04 +0000 (17:00 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Four smaller fixes for the current series. This contains:
- A fix for clones of discard bio's, that can cause data corruption.
From Martin.
- A fix for null_blk, where in certain queue modes it could access a
request after it had been freed. From Mike Krinkin.
- An error handling leak fix for blkcg, from Tejun.
- Also from Tejun, export of the functions that a file system needs
to implement cgroup writeback support"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Do a full clone when splitting discard bios
block: export bio_associate_*() and wbc_account_io()
blkcg: fix gendisk reference leak in blkg_conf_prep()
null_blk: fix use-after-free problem
Linus Torvalds [Fri, 24 Jul 2015 23:54:59 +0000 (16:54 -0700)]
Merge branch 'for-4.2-fixes' of git://git./linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"A couple important fixes.
- A block layer change which removed restriction on max transfer size
led to silent data corruption on some devices. A new quirk is
added to restore the old size limit for the reported device. If it
gets reported on more devices, we might have to consider restoring
the restriction for ATA devices by default.
- There finally is a SSD which is confirmed to cause data corruption
on TRIM regardless of which flavor is used. A new quirk is added
and the device is blacklisted
- Other device-specific workarounds"
* 'for-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: Do not blacklist M510DC
libata: increase the timeout when setting transfer mode
libata: add ATA_HORKAGE_MAX_SEC_1024 to revert back to previous max_sectors limit
libata: force disable trim for SuperSSpeed S238
libata: add ATA_HORKAGE_NOTRIM
libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for HP 250GB SATA disk VB0250EAVER
ata: pmp: add quirk for Marvell 4140 SATA PMP
Linus Torvalds [Fri, 24 Jul 2015 23:43:16 +0000 (16:43 -0700)]
Merge tag 'mmc-4.2-rc3' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC fixes from Ulf Hansson:
"Here are some mmc fixes intended for v4.2 rc4.
Note, most of the changes are for the sdhci-esdhc-imx controller,
which also required us to modify some related DTS files. Those
changes have been acked by the SoC maintainer.
MMC core:
- Fix a reference inbalance issue for power_ro_lock_show() sysfs handler
MMC host:
- omap_hsmmc: Fix IRQ errorhandling for CD, DTO, and CRC
- sdhci: Prevent a kernel panic while using DMA
- mtk-sd: Let it depend on HAS_DMA to prevent build errors
- sdhci-esdhc: Make 8BIT bus work
- sdhci-esdhc-imx: Fix some regressions for DT based platforms
- sdhci-pxav3: Fix a regression for DT based platforms"
* tag 'mmc-4.2-rc3' of git://git.linaro.org/people/ulf.hansson/mmc:
mmc: sdhci-pxav3: fix platform_data is not initialized
dts: mmc: fsl-imx-esdhc: remove fsl,cd-controller support
mmc: sdhci-esdhc-imx: clear f_max in boarddata
mmc: sdhci-esdhc-imx: remove duplicated dts parsing
mmc: sdhci: make max-frequency property in device tree work
mmc: sdhci-esdhc-imx: move all non dt probe code into one function
mmc: sdhci-esdhc-imx: fix cd regression for dt platform
dts: imx7: fix sd card gpio polarity specified in device tree
dts: imx25: fix sd card gpio polarity specified in device tree
dts: imx6: fix sd card gpio polarity specified in device tree
dts: imx53: fix sd card gpio polarity specified in device tree
dts: imx51: fix sd card gpio polarity specified in device tree
mmc: sdhci-esdhc: Make 8BIT bus work
mmc: block: Add missing mmc_blk_put() in power_ro_lock_show()
mmc: MMC_MTK should depend on HAS_DMA
mmc: sdhci check parameters before call dma_free_coherent
mmc: omap_hsmmc: Handle BADA, DEB and CEB interrupts
mmc: omap_hsmmc: Fix DTO and DCRC handling
Linus Torvalds [Fri, 24 Jul 2015 23:33:11 +0000 (16:33 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"A fix for the warnings/oops when handling HID devices with "unnamed"
LEDs and couple of other driver fixups""
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: goodix - fix touch coordinates on WinBook TW100 and TW700
Input: LEDs - skip unnamed LEDs
Input: usbtouchscreen - avoid unresponsive TSC-30 touch screen
Input: elantech - force resolution of 31 u/mm
Input: zforce - don't overwrite the stack
Linus Torvalds [Fri, 24 Jul 2015 20:14:06 +0000 (13:14 -0700)]
Merge tag 'regulator-fix-v4.2-rc3' of git://git./linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"As well as some driver specific fixes there's several fixes here for
the core support for regulators supplying other regulators fixing both
an issue with ACPI support (which had never been tested before) and
some error handling and device removal issues that Javier noticed"
* tag 'regulator-fix-v4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Fix memory leak in regulator_resolve_supply()
regulator: core: Increase refcount for regulator supply's module
regulator: core: Handle full constraints systems when resolving supplies
regulator: 88pm800: fix LDO vsel_mask value
regulator: max8973: Fix up control flag option for bias control
regulator: s2mps11: Fix GPIO suspend enable shift wrapping bug
Linus Torvalds [Fri, 24 Jul 2015 20:13:28 +0000 (13:13 -0700)]
Merge tag 'spi-fix-v4.2-rc3' of git://git./linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of pretty much unremarkable driver specific fixes
here plus the addition of a new device ID to spidev which requires no
other code changes"
* tag 'spi-fix-v4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: imx: Fix small DMA transfers
spi: zynq: missing break statement
spi: SPI_ZYNQMP_GQSPI should depend on HAS_DMA
spi: spidev: add compatible value for LTC2488
spi: img-spfi: fix support for speeds up to 1/4th input clock
Linus Torvalds [Fri, 24 Jul 2015 19:50:19 +0000 (12:50 -0700)]
Merge tag 'sound-4.2-rc4' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This has been a calm week again: one minor lockdep fix for PCM core,
and the most of the rest are HD-audio quirks and fixups for various
chips and machines"
* tag 'sound-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Add headset mic pin quirk for a Dell device
ALSA: hda - remove one pin from ALC292_STANDARD_PINS
ALSA: hda - Add new GPU codec ID 0x10de007d to snd-hda
ALSA: hda: add new AMD PCI IDs with proper driver caps
ALSA: hda - Fix Skylake codec timeout
ALSA: hda - Add headset mic support for Acer Aspire V5-573G
ALSA: sparc: Add missing kfree in error path
ALSA: pcm: Fix lockdep warning with nonatomic PCM ops
Linus Torvalds [Fri, 24 Jul 2015 19:44:24 +0000 (12:44 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- kernel crash fixes for multitouch and wacom drivers, by Brent Adam
and Dan Carpenter
- cp2112 data packet race condition corruption fix, by Antonio Borneo
- a few new device IDs for wacom and microsoft drivers
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: cp2112: fix to force single data-report reply
HID: wacom: Enable pad device for older Bamboo Touch tablets
HID: multitouch: Fix fields from pen report ID being interpreted for multitouch
HID: microsoft: Add quirk for MS Surface Type/Touch cover
HID: wacom: NULL dereferences on error in probe()
Linus Torvalds [Fri, 24 Jul 2015 19:37:29 +0000 (12:37 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Aome amdgpu, one i915, one ttm and one hlcdc, nothing too scary.
All seems fine for about this time"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/ttm: recognize ARM64 arch in ioprot handler
drm/amdgpu/cz/dpm: properly report UVD and VCE clock levels
drm/amdgpu/cz: implement voltage validation properly
drm/amdgpu: add VCE harvesting instance query
drm/amdgpu: implement VCE 3.0 harvesting support (v4)
drm/amdgpu/dce10: Re-set VBLANK interrupt state when enabling a CRTC
drm/amdgpu/dce11: Re-set VBLANK interrupt state when enabling a CRTC
drm: Stop resetting connector state to unknown
drm/i915: Use two 32bit reads for select 64bit REG_READ ioctls
drm: atmel-hlcdc: fix vblank initial state
Linus Torvalds [Fri, 24 Jul 2015 19:31:30 +0000 (12:31 -0700)]
Merge branch 'stable' of git://git./linux/kernel/git/cmetcalf/linux-tile
Pull arch/tile bugfix from Chris Metcalf:
"This fixes a bug in freeing the initramfs memory"
* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: use free_bootmem_late() for initrd
Linus Torvalds [Fri, 24 Jul 2015 19:23:06 +0000 (12:23 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Everything related to the new quirks and memory type features:
- small improvements to the quirks API
- extending one of the quirks from just AMD to Intel as well, because
4.2 can show the same problem with problematic firmware on Intel
too"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: rename quirk constants to KVM_X86_QUIRK_*
KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
KVM: x86: introduce kvm_check_has_quirk
KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
KVM: MTRR: fix memory type handling if MTRR is completely disabled
Steven Rostedt (Red Hat) [Fri, 24 Jul 2015 14:38:12 +0000 (10:38 -0400)]
ftrace: Fix breakage of set_ftrace_pid
Commit
4104d326b670 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.
Add back the proper hooks to have pid filtering working once again.
Cc: stable@vger.kernel.org # 3.16+
Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Bastien Nocera [Fri, 24 Jul 2015 16:08:53 +0000 (09:08 -0700)]
Input: goodix - fix touch coordinates on WinBook TW100 and TW700
The touchscreen on the WinBook TW100 and TW700 don't match the default
display, with 0,0 touches being reported when touching at the bottom
right of the screen.
1280,800 0,800
+-------------+
| |
| |
| |
+-------------+
1280,0 0,0
It's unfortunately impossible to detect this problem with data from the
DSDT, or other auxiliary metadata, so fallback to quirking this specific
model of tablet instead.
Signed-off-by: Bastien Nocera <hadess@hadess.net>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Dmitry Torokhov [Wed, 22 Jul 2015 21:56:39 +0000 (14:56 -0700)]
Input: LEDs - skip unnamed LEDs
Devices may declare more LEDs than what is known to input-leds
(HID does this for some devices). Instead of showing ugly warnings
on connect and, even worse, oopsing on disconnect, let's simply
ignore LEDs that are not known to us.
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Mark Brown [Fri, 24 Jul 2015 15:19:50 +0000 (16:19 +0100)]
Merge remote-tracking branches 'spi/fix/gqspi', 'spi/fix/imx', 'spi/fix/mg-spfi' and 'spi/fix/spidev' into spi-linus
Mark Brown [Fri, 24 Jul 2015 15:19:25 +0000 (16:19 +0100)]
Merge remote-tracking branches 'regulator/fix/88pm800', 'regulator/fix/max8973', 'regulator/fix/s2mps11' and 'regulator/fix/supply' into regulator-linus
Mark Brown [Fri, 24 Jul 2015 15:19:25 +0000 (16:19 +0100)]
Merge remote-tracking branch 'regulator/fix/core' into regulator-linus
Sascha Hauer [Fri, 24 Jul 2015 13:01:08 +0000 (15:01 +0200)]
spi: imx: Fix small DMA transfers
DMA transfers must be greater than the watermark level size. spi_imx->rx_wml
and spi_imx->tx_wml contain the watermark level in 32bit words whereas struct
spi_transfer contains the transfer len in bytes. Fix the check if DMA is
possible for a transfer accordingly. This fixes transfers with sizes between
33 and 128 bytes for which previously was claimed that DMA is possible.
Fixes: f62caccd12c17e4 (spi: spi-imx: add DMA support)
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Denys Vlasenko [Fri, 24 Jul 2015 12:16:43 +0000 (14:16 +0200)]
x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
This change reverts most of commit
53e9accf0f 'Do not use R9 in
SYSCALL32'. I don't yet understand how, but code in that commit
sometimes fails to preserve EBP.
See https://bugzilla.kernel.org/show_bug.cgi?id=101061
"Problems while executing 32-bit code on AMD64"
Reported-and-tested-by: Krzysztof A. Sobiecki <sobkas@gmail.com>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Will Drewry <wad@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/1437740203-11552-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Fri, 24 Jul 2015 14:13:43 +0000 (16:13 +0200)]
x86/mm: Fix newly introduced printk format warnings
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bob Liu [Wed, 22 Jul 2015 06:40:10 +0000 (14:40 +0800)]
xen-blkback: replace work_pending with work_busy in purge_persistent_gnt()
The BUG_ON() in purge_persistent_gnt() will be triggered when previous purge
work haven't finished.
There is a work_pending() before this BUG_ON, but it doesn't account if the work
is still currently running.
CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Bob Liu [Wed, 22 Jul 2015 06:40:09 +0000 (14:40 +0800)]
xen-blkfront: don't add indirect pages to list when !feature_persistent
We should consider info->feature_persistent when adding indirect page to list
info->indirect_pages, else the BUG_ON() in blkif_free() would be triggered.
When we are using persistent grants the indirect_pages list
should always be empty because blkfront has pre-allocated enough
persistent pages to fill all requests on the ring.
CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Bob Liu [Wed, 22 Jul 2015 06:40:08 +0000 (14:40 +0800)]
xen-blkfront: introduce blkfront_gather_backend_features()
There is a bug when migrate from !feature-persistent host to feature-persistent
host, because domU still thinks new host/backend doesn't support persistent.
Dmesg like:
backed has not unmapped grant: 839
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 839
The fix is to recheck feature-persistent of new backend in blkif_recover().
See: https://lkml.org/lkml/2015/5/25/469
As Roger suggested, we can split the part of blkfront_connect that checks for
optional features, like persistent grants, indirect descriptors and
flush/barrier features to a separate function and call it from both
blkfront_connect and blkif_recover
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jingju Hou [Thu, 23 Jul 2015 09:56:23 +0000 (17:56 +0800)]
mmc: sdhci-pxav3: fix platform_data is not initialized
pdev->dev.platform_data is not initialized if match is true in function
sdhci_pxav3_probe. Just local variable pdata is assigned the return value
from function pxav3_get_mmc_pdata().
static int sdhci_pxav3_probe(struct platform_device *pdev) {
struct sdhci_pxa_platdata *pdata = pdev->dev.platform_data;
...
if (match) {
ret = mmc_of_parse(host->mmc);
if (ret)
goto err_of_parse;
sdhci_get_of_property(pdev);
pdata = pxav3_get_mmc_pdata(dev);
}
...
}
Signed-off-by: Jingju Hou <houjingj@marvell.com>
Fixes: b650352dd3df("mmc: sdhci-pxa: Add device tree support")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:10 +0000 (20:53 +0800)]
dts: mmc: fsl-imx-esdhc: remove fsl,cd-controller support
It's not supported by driver anymore after using runtime pm
and there's no user of it, so delete it now.
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:09 +0000 (20:53 +0800)]
mmc: sdhci-esdhc-imx: clear f_max in boarddata
After commit
8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
it's not used anymore.
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:08 +0000 (20:53 +0800)]
mmc: sdhci-esdhc-imx: remove duplicated dts parsing
After commit
8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
we do not need those duplicated parsing anymore.
Note: fsl,cd-controller is also deleted due to the driver does
not support controller card detection anymore after switch to runtime pm.
And there's no user of it right now in device tree.
wp-gpios is kept because we're still support fsl,wp-controller,
so we need a way to check if it's gpio wp or controller wp.
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:07 +0000 (20:53 +0800)]
mmc: sdhci: make max-frequency property in device tree work
Device tree provides option to specify the max freqency with property
"max-frequency" in dts and common parse function mmc_of_parse() will
parse it and use this value to set host->f_max to tell the MMC core
the maxinum frequency the host works.
However, current sdhci driver will finally overwrite this value with
host->max_clk regardless of the max-frequency property.
This patch makes sure not overwrite the max-frequency set from device
tree and do basic sanity check.
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:06 +0000 (20:53 +0800)]
mmc: sdhci-esdhc-imx: move all non dt probe code into one function
This is an incremental fix of commit
e62bd351b("mmc: sdhci-esdhc-imx: Do not break platform data boards").
After commit
8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
we do not need to run the check of boarddata->wp_type/cd_type/max_bus_width
again for dt platform since those are already handled by mmc_of_parse().
Current code only exclude the checking of wp_type for dt platform which
does not make sense.
This patch moves all non dt probe code into one function.
Besides, since we only support SD3.0/eMMC HS200 for dt platform, the
support_vsel checking and ultra high speed pinctrl state are also merged
into sdhci_esdhc_imx_probe_dt.
Then we have two separately probe function for dt and non dt type.
This can make the driver probe more clearly.
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:05 +0000 (20:53 +0800)]
mmc: sdhci-esdhc-imx: fix cd regression for dt platform
Current card detect probe process is that when driver finds a valid
ESDHC_CD_GPIO, it will clear the quirk SDHCI_QUIRK_BROKEN_CARD_DETECTION
which is set by default for all esdhc/usdhc controllers.
Then host driver will know there's a valid card detect function.
Commit
8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()")
breaks GPIO CD function for dt platform that it will return directly
when find ESDHC_CD_GPIO for dt platform which result in the later wrongly
to keep SDHCI_QUIRK_BROKEN_CARD_DETECTION for all dt platforms.
Then MMC_CAP_NEEDS_POLL will be used instead even there's a valid
GPIO card detect.
This patch adds back this function and follows the original approach to
clear the quirk if find an valid CD GPIO for dt platforms.
Fixes: 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()")
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:04 +0000 (20:53 +0800)]
dts: imx7: fix sd card gpio polarity specified in device tree
cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:03 +0000 (20:53 +0800)]
dts: imx25: fix sd card gpio polarity specified in device tree
cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Dong Aisheng [Wed, 22 Jul 2015 12:53:02 +0000 (20:53 +0800)]
dts: imx6: fix sd card gpio polarity specified in device tree
cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>