git.openwrt.org Git - openwrt/staging/blogic.git/log

blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy

cgroup is trying to make interface consistent across different
controllers.  For weight based resource control, the knob should have
the range [1, 10000] and default to 100.  This patch updates
cfq-iosched so that the weight range conforms.  The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.

* blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
  when blkcg is attached to a hierarchy.

* cfq_cpd_init() is updated to use the new default value on the
  unified hierarchy.

* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
  apply the default config matching the hierarchy type.

* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
  is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
  now responsible for initializing the initial weights when blkcg is
  enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/

blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: implement interface for the unified hierarchy

blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too.  The interface files are
inconsistent with a number of cloes duplicates.  Some files have
recursive variants while others don't.  There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.

In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface.  The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.

* (from a previous patch) blkcg is identified by "io" instead of
  "blkio" on the unified hierarchy.  Given that the whole interface is
  updated anyway, the rename shouldn't carry noticeable conversion
  overhead.

* The original interface consisted of 27 files is replaced with the
  following three files.

  blkio.stat : per-blkcg stats
  blkio.weight : per-cgroup and per-cgroup-queue weight settings
  blkio.max : per-cgroup-queue bps and iops max limits

Documentation/cgroups/unified-hierarchy.txt updated accordingly.

v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: misc preparations for unified hierarchy interface

* Export blkg_dev_name()

* Drop unnecessary @cft from __cfq_set_weight().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: separate out tg_conf_updated() from tg_set_conf()

tg_set_conf() is largely consisted of parsing and setting the new
config and the follow-up application and propagation. This patch
separates out the latter part into tg_conf_updated(). This will be
used to implement interface for the unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: move body parsing from blkg_conf_prep() to its callers

Currently, blkg_conf_prep() expects input to be of the following form

MAJ:MIN NUM

and reads the NUM part into blkg_conf_ctx->v.  This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy.  This patch updates blkg_conf_prep() so that it
expects

MAJ:MIN BODY_STR

where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.

To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.

This doesn't cause any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: mark existing cftypes as legacy

blkcg is about to grow interface for the unified hierarchy. Add
legacy to existing cftypes.

* blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
* blk-cgroup.c:blkcg_files -> blkcg_legacy_files
* cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
* blk-throttle.c:throtl_files -> throtl_legacy_files

Pure renames. No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: rename subsystem name from blkio to io

blkio interface has become messy over time and is currently the
largest.  In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.

Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes.  cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.

As the first step, this patch renames the external visible subsystem
name from "blkio" to "io".  This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.

As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot.  On the unified hierarchy, blkcg now appears as "io".

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: refine error codes returned during blkcg configuration

blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that

* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()

blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
enabled are guaranteed to return non-NULL and the counterpart in
blk-throttle doesn't have these checks either. Remove the spurious
NULL checks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: reduce stack usage of blkg_rwstat_recursive_sum()

The recent percpu conversion of blkg_rwstat triggered the following
warning in certain configurations.

block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes

This is because blkg_rwstat now contains four percpu_counter which can
be pretty big depending on debug options although it shouldn't be a
problem in production configs. This patch removes one of the two
local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
reduce stack usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: remove cfqg_stats->sectors

cfq_stats->sectors is a blkg_stat which keeps track of the total
number of sectors serviced; however, this can be trivially calculated
from blkcg_gq->stat_bytes. The only thing necessary is adding up
READs and WRITEs and then dividing by sector size.

Remove cfqg_stats->sectors and make cfq print "sectors" and
"sectors_recursive" from stat_bytes.

While this is a bit more code, it removes duplicate stat allocations
and updates and ensures that the reported stats stay in tune with each
other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: move io_service_bytes and io_serviced stats into blkcg_gq

Currently, both cfq-iosched and blk-throttle keep track of
io_service_bytes and io_serviced stats.  While keeping track of them
separately may be useful during development, it doesn't make much
sense otherwise.  Also, blk-throttle was counting bio's as IOs while
cfq-iosched request's, which is more confusing than informative.

This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
removes the counterparts from cfq-iosched and blk-throttle and let
them print from the common blkg counters.  The common counters are
incremented during bio issue in blkcg_bio_issue_check().

The outputs are still filtered by whether the policy has
blkg_policy_data on a given blkg, so cfq's output won't show up if it
has never been used for a given blkg.  The only times when the outputs
would differ significantly are when policies are attached on the fly
or elevators are switched back and forth.  Those are quite exceptional
operations and I don't think they warrant keeping separate counters.

v3: Update blkio-controller.txt accordingly.

v2: Account IOs during bio issues instead of request completions so
    that bio-based drivers can be handled the same way.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq

Currently, blkg_[rw]stat_recursive_sum() assume that the target
counter is located in pd (blkg_policy_data); however, some counters
are planned to be moved to blkg (blkcg_gq).

This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
blkg_policy pointers instead of pd. If policy is NULL, it indexes
into blkg. If non-NULL, into the blkg's pd of the policy.

The existing usages are updated to maintain the current behaviors.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: make blkcg_[rw]stat per-cpu

blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
per-cpu wrapping in blk-throttle.

* blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
  percpu_counter.  This makes syncp unnecessary as remote accesses are
  handled by percpu_counter itself.

* blkg_[rw]stat_init() can now fail due to percpu allocation failure
  and thus are updated to return int.

* percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.

* As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
  and summing results are stored in ->aux_cnt[] instead.

* Custom per-cpu stat implementation in blk-throttle is removed.

This makes all blkcg stat counters per-cpu without complicating policy
implmentations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it

cgroup stats are local to each cgroup and doesn't propagate to
ancestors by default.  When recursive stats are necessary, the sum is
calculated over all the descendants.  This initially was for backward
compatibility to support both group-local and recursive stats but this
mode of operation makes general sense as stat update is much hotter
thafn reporting those stats.

This however ends up losing recursive stats when a child is removed.
To work around this, cfq-iosched adds its stats to its parent
cfq_group->dead_stats which is summed up together when calculating
recursive stats.

It's planned that the core stats will be moved to blkcg_gq, so we want
to move the mechanism for keeping track of the stats of dead children
from cfq to blkcg core.  This patch adds blkg_[rw]stat->aux_cnt which
are atomic64_t's keeping track of auxiliary counts which are excluded
when reading local counts but included for recursive.

blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
a dead cgroup to the aux counts of parent->stats instead of separate
->dead_stats.

This will also help making blkg_[rw]stats per-cpu.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: consolidate blkg creation in blkcg_bio_issue_check()

blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies.  Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.

This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks().  blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's.  The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.

* blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
  instead of blkg_lookup_create().

* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
  read locked and blkg already looked up.  Both throtl_lookup_tg() and
  throtl_lookup_create_tg() are dropped.

* cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
  cfq_lookup_cfqg()which uses blkg_lookup().

This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure.  In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.

v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
    !CONFIG_BLK_DEV_THROTTLING.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blk-throttle: improve queue bypass handling

If a queue is bypassing, all blkcg policies should become noops but
blk-throttle wasn't. It only became noop if the queue was dying.
While this wouldn't lead to an oops as falling back to the root blkg
is safe in this case, this can be a bit surprising - a bypassing queue
could still be applying throttle limits.

Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
before doing anything else.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()

Currently, both throttle and cfq policies implement their own root
blkg (blkcg_gq) lookup fast path. This patch moves root blkg
optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched
currently doesn't use blkg_lookup() but will be converted and drop the
optimization too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: inline [__]blkg_lookup()

blkg_lookup() checks whether the target queue is bypassing and, if
not, calls __blkg_lookup() which first checks the lookup hint and then
performs radix tree walk.  The operations upto hint checking are
trivial and there are many users of this function.  This patch inlines
blkg_lookup() and the fast path part of __blkg_lookup().  The radix
tree lookup and hint update are now in blkg_lookup_slowpath().

This will help consolidating blkg handling by easing moving root blkcg
short-circuit to inlined lookup fast path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods

Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
cpd's were allocated by blkcg core and each policy could request to
allocate extra space at the end by setting blkcg_policy->cpd_size
larger than the size of cpd.

This is a bit unusual but blkg (blkcg_gq) policy data used to be
handled this way too so it made sense to be consistent; however, blkg
policy data switched to alloc/free callbacks.

This patch makes similar changes to cpd handling.
blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
cpd allocation is now done from policy side, it can simply allocate a
larger area which embeds cpd at the beginning.

As ->cpd_alloc_fn() may be able to perform all necessary
initializations, this patch makes ->cpd_init_fn() optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: minor updates around blkcg_policy_data

* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
  for blkcg_policy_data.

* Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
  blkcg.  This makes it consistent with blkg_policy_data methods and
  to-be-added cpd alloc/free methods.

* blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
  cpd_init_fn() can determine the associated blkcg from
  blkcg_policy_data.

v2: blkcg_policy_data->blkcg initializations were missing.  Added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data

The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
(blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
can always be mapped to blkg and given that these are policy-specific
methods, it makes sense to converge on pd.

This patch makes all methods deal with pd instead of blkg.  Most
conversions are trivial.  In blk-cgroup.c, a couple method invocation
sites now test whether pd exists instead of policy state for
consistency.  This shouldn't cause any behavioral differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods

With the recent addition of alloc and free methods, things became
messier.  This patch reorganizes them according to the followings.

* ->pd_alloc_fn()

  Responsible for allocation and static initializations - the ones
  which can be done independent of where the pd might be attached.

* ->pd_init_fn()

  Initializations which require the knowledge of where the pd is
  attached.

* ->pd_free_fn()

  The counter part of pd_alloc_fn().  Static de-init and freeing.

This leaves ->pd_exit_fn() without any users.  Removed.

While at it, collapse an one liner function throtl_pd_exit(), which
has only one user, into its user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blk-throttle: remove asynchrnous percpu stats allocation mechanism

Because percpu allocator couldn't do non-blocking allocations,
blk-throttle was forced to implement an ad-hoc asynchronous allocation
mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
allocated from an IO path without sleepable context.

Now that percpu allocator can handle gfp_mask and blkg_policy_data
alloc / free are handled by policy methods, the ad-hoc asynchronous
allocation mechanism can be replaced with direct allocation from
tg_stats_alloc_fn(). Rit it out.

This ensures that an active throtl_grp always has valid non-NULL
->stats_cpu. Remove checks on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods

A blkg (blkcg_gq) represents the relationship between a cgroup and
request_queue.  Each active policy has a pd (blkg_policy_data) on each
blkg.  The pd's were allocated by blkcg core and each policy could
request to allocate extra space at the end by setting
blkcg_policy->pd_size larger than the size of pd.

This is a bit unusual but was done this way mostly to simplify error
handling and all the existing use cases could be handled this way;
however, this is becoming too restrictive now that percpu memory can
be allocated without blocking.

This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
and pd_free_fn() - which are used to allocate and release pd for a
given policy.  As pd allocation is now done from policy side, it can
simply allocate a larger area which embeds pd at the beginning.  This
change makes ->pd_size pointless.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn

blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't. As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything. Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy()

When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's).  Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.

This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure.  If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration.  This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths

blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root.  There's no
reason to to treat policy data any differently for blkcg_root.  If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will.  The free path isn't never invoked for root csses.

This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: use blkg_free() in blkcg_init_queue() failure path

When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas. Make it use blkg_free() instead. In turn, blkg_free() is
updated to handle root request_list special case.

While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too. I don't think it needs to be tagged for -stable.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl()

Since ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root
blkcg"), a request_list always has its blkg associated. Drop
unnecessary rl->blkg NULL test from blk_put_rl().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root

Up until now, all async IOs were queued to async queues which are
shared across the whole request_queue, which means that blkcg resource
control is completely void on async IOs including all writeback IOs.
It was done this way because writeback didn't support writeback and
there was no way of telling which writeback IO belonged to which
cgroup; however, writeback recently became cgroup aware and writeback
bio's are sent down properly tagged with the blkcg's to charge them
against.

This patch makes async cfq_queues per-cfq_cgroup instead of
per-cfq_data so that each async IO is charged to the blkcg that it was
tagged for instead of unconditionally attributing it to root.

* cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
  and alloc / destroy paths are updated accordingly.

* cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
  queues.

* check_blkcg_changed() now also invalidates async queues as they no
  longer stay the same across cgroups.

After this patch, cfq's proportional IO control through blkio.weight
works correctly when cgroup writeback is in use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue()

cfq_find_alloc_queue() checks whether a queue actually needs to be
allocated, which is unnecessary as its sole caller, cfq_get_queue(),
only calls it if so.  Also, the oom queue fallback logic is scattered
between cfq_get_queue() and cfq_find_alloc_queue().  There really
isn't much going on in the latter and things can be made simpler by
folding it into cfq_get_queue().

This patch collapses cfq_find_alloc_queue() into cfq_get_queue().  The
change is fairly straight-forward with one exception - async_cfqq is
now initialized to NULL and the "!is_sync" test in the last if
conditional is replaced with "async_cfqq" test.  This is because gcc
(5.1.1) gets confused for some reason and warns that async_cfqq may be
used uninitialized otherwise.  Oh well, the code isn't necessarily
worse this way.

This patch doesn't cause any functional difference.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue()

This is necessary for making async cfq_cgroups per-cfq_group instead
of per-cfq_data. While this change makes cfq_get_queue() perform RCU
locking and look up cfq_group even when it reuses async queue, the
extra overhead is extremely unlikely to be noticeable given that this
is already sitting behind cic->cfqq[] cache and the overall cost of
cfq operation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue()

Even when allocations fail, cfq_find_alloc_queue() always returns a
valid cfq_queue by falling back to the oom cfq_queue.  As such, there
isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
is set.  GFP_NOWAIT allocations don't fail often and even when they do
the degraded behavior is acceptable and temporary.

After all, the only reason get_request(), which ultimately determines
the gfp_mask, cares about __GFP_WAIT is to guarantee request
allocation, assuming IO forward progress, for callers which are
willing to wait.  There's no reason for cfq_find_alloc_queue() to
behave differently on __GFP_WAIT when it already has a fallback
mechanism.

Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
to its callers.  This simplifies the function quite a bit and will
help making async queues per-cfq_group.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations

blkcg performs several allocations to track IOs per cgroup and enforce
resource control. Most of these allocations are performed lazily on
demand in the IO path and thus can't involve reclaim path. Currently,
these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
with occassional failures of these allocations by punting IOs to the
root cgroup and there's no reason to reach into the emergency reserve.

This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
allocations.

* bdi_writeback_congested and blkcg_gq allocations in blkg_create().

* radix tree node allocations for blkcg->blkg_tree.

* cfq_queue allocation on ioprio changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-and-Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Suggested-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: minor cleanups

* Some were accessing cic->cfqq[] directly.  Always use cic_to_cfqq()
  and cic_set_cfqq().

* check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
  return for NULL.  It's always non-NULL.  Simplify accordingly.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request()

If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
replaces it by invoking cfq_get_queue() again without putting the oom
queue leaking the reference it was holding. While oom queues are not
released through reference counting, they're still reference counted
and this can theoretically lead to the reference count overflowing and
incorrectly invoke the usual release path on it.

Fix it by making cfq_set_request() put the ref it was holding.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: fix async oom queue handling

Async cfqq's (cfq_queue's) are shared across cfq_data. When
cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
stashes the pointer in cfq_data and reuses it from then on; however,
the function doesn't consider that cfq_find_alloc_queue() may return
the oom_cfqq under memory pressure and installs the returned queue
unconditionally.

If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
continue calling cfq_get_queue() hoping to replace it with a proper
queue; however, cfq_get_queue() will keep returning the cached queue
for the slot - the oom_cfqq.

Fix it by skipping caching if the queue is the oom one.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

cfq-iosched: simplify control flow in cfq_get_queue()

cfq_get_queue()'s control flow looks like the following.

async_cfqq = NULL;
cfqq = NULL;

if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
}

if (!cfqq)
cfqq = ...;

if (!is_sync && !(*async_cfqq))
...;

The only thing the local variable init, the second if, and the
async_cfqq test in the third if achieves is to skip cfqq creation and
installation if *async_cfqq was already non-NULL. This is needlessly
complicated with different tests examining the same condition.
Simplify it to the following.

if (!is_sync) {
...
async_cfqq = ...;
cfqq = *async_cfqq;
if (cfqq)
goto out;
}

cfqq = ...;

if (!is_sync)
...;
out:

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

writeback: update writeback tracepoints to report cgroup

The following tracepoints are updated to report the cgroup used during
cgroup writeback.

* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]

Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it. Tracepoints which take
bdi are updated to take bdi_writeback instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

kernfs: implement kernfs_path_len()

Add a function to determine the path length of a kernfs node. This
for now will be used by writeback tracepoint updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

writeback: explain why @inode is allowed to be NULL for inode_congested()

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

writeback: remove wb_writeback_work->single_wait/done

wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.

This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>

writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg

wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
earlier implementation, wb's were keyed by blkcg ID.
bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
allows iterations to start from an arbitrary ID which is used to
interrupt and resume iterations.

Unfortunately, while changing wb to be keyed by memcg ID instead of
blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
are keyed by blkcg ID. This doesn't affect iterations which don't get
interrupted but bdi_split_work_to_wbs() makes use of iteration
resuming on allocation failures and thus may incorrectly skip or
repeat wb's.

Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
blkcg IDs and updating bdi_split_work_to_wbs() accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

Merge branch 'for-4.3-unified-base' of git://git./linux/kernel/git/tj/cgroup into for-4.3/blkcg

cgroup: introduce cgroup_subsys->legacy_name

This allows cgroup subsystems to use a different name on the unified
hierarchy.  cgroup_subsys->name is used on the unified hierarchy,
->legacy_name elsewhere.  If ->legacy_name is not explicitly set, it's
automatically set to ->name and the userland visible behavior remains
unchanged.

v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
    options are used only on legacy hierarchies.  Suggested by Li
    Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org

cgroup: don't print subsystems for the default hierarchy

It doesn't make sense to print subsystems on mount option or
/proc/PID/cgroup for the default hierarchy.

* cgroup.controllers file at the root of the default hierarchy lists
the currently attached controllers.

* The default hierarchy is catch-all for unmounted subsystems.

* The default hierarchy doesn't accept any mount options.

Suppress subsystem printing on mount options and /proc/PID/cgroup for
the default hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org

cgroup: make cftype->private a unsigned long

It's pretty unusual to have an int as a private data field and it
makes it impossible to carray a pointer value through it. Let's make
it an unsigned long. AFAICS, this shouldn't break anything.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

cgroup: export cgrp_dfl_root

While cgroup subsystems can't be modules, blkcg supports dynamically
loadable policies which interact with cgroup core. Export
cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

cgroup: define controller file conventions

Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.

Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes.  Let's take
the chance to establish common conventions throughout all controllers.

This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy.  Except for io.weight knob,
all existing unified hierarchy knobs are already compliant.  A
follow-up patch will update io.weight.

v2: Added descriptions of min, low and high knobs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>

Merge branch 'stable/for-jens-4.2' of git://git./linux/kernel/git/konrad/xen into for-linus

Konrad writes:

"There are three bugs that have been found in the xen-blkfront (and
backend). Two of them have the stable tree CC-ed. They have been found
where an guest is migrating to a host that is missing
'feature-persistent' support (from one that has it enabled). We end up
hitting an BUG() in the driver code."

Linux 4.2-rc4

Merge branch 'perf-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull perf fix from Thomas Gleixner:
"A single fix for the intel cqm perf facility to prevent IPIs from
interrupt context"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/cqm: Return cached counter value from IRQ context

Merge branch 'x86-urgent-for-linus' of git://git./linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
"This update contains:

   - the manual revert of the SYSCALL32 changes which caused a
     regression

   - a fix for the MPX vma handling

   - three fixes for the ioremap 'is ram' checks.

   - PAT warning fixes

   - a trivial fix for the size calculation of TLB tracepoints

   - handle old EFI structures gracefully

  This also contains a PAT fix from Jan plus a revert thereof.  Toshi
  explained why the code is correct"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/pat: Revert 'Adjust default caching mode translation tables'
  x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
  x86/mm: Fix newly introduced printk format warnings
  mm: Fix bugs in region_is_ram()
  x86/mm: Remove region_is_ram() call from ioremap
  x86/mm: Move warning from __ioremap_check_ram() to the call site
  x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
  x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
  x86/mm/pat: Adjust default caching mode translation tables
  x86/fpu: Disable dependent CPU features on "noxsave"
  x86/mpx: Do not set ->vm_ops on MPX VMAs
  x86/mm: Add parenthesis for TLB tracepoint size calculation
  efi: Handle memory error structures produced based on old versions of standard

x86/mm/pat: Revert 'Adjust default caching mode translation tables'

Toshi explains:

"No, the default values need to be set to the fallback types,
i.e. minimal supported mode. For WC and WT, UC is the fallback type.

When PAT is disabled, pat_init() does update the tables below to
enable WT per the default BIOS setup. However, when PAT is enabled,
but CPU has PAT -errata, WT falls back to UC per the default values."

Revert: ca1fec58bc6a 'x86/mm/pat: Adjust default caching mode translation tables'
Requested-by: Toshi Kani <toshi.kani@hp.com>
Cc: Jan Beulich <jbeulich@suse.de>
Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

perf/x86/intel/cqm: Return cached counter value from IRQ context

Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] ------------[ cut here ]------------
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
[  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
[  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
[  148.765809] Call Trace:
[  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
[  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
[  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
[  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
[  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
[  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
[  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
[  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
[  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
[  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
[  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
[  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
[  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
[  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
[  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
[  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
[  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
[  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
[  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".

Make do with the last read counter value currently event in
event->count when we're invoked in this context.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Will Auld <will.auld@intel.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Merge tag 'usb-4.2-rc4' of git://git./linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here's a few USB and PHY fixes for 4.2-rc4.

  Nothing major, the shortlog has the full details.

  All of these have been in linux-next successfully"

* tag 'usb-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (21 commits)
  USB: OHCI: fix bad #define in ohci-tmio.c
  cdc-acm: Destroy acm_minors IDR on module exit
  usb-storage: Add ignore-device quirk for gm12u320 based usb mini projectors
  usb-storage: ignore ZTE MF 823 card reader in mode 0x1225
  USB: OHCI: Fix race between ED unlink and URB submission
  usb: core: lpm: set lpm_capable for root hub device
  xhci: do not report PLC when link is in internal resume state
  xhci: prevent bus_suspend if SS port resuming in phase 1
  xhci: report U3 when link is in resume state
  xhci: Calculate old endpoints correctly on device reset
  usb: xhci: Bugfix for NULL pointer deference in xhci_endpoint_init() function
  xhci: Workaround to get D3 working in Intel xHCI
  xhci: call BIOS workaround to enable runtime suspend on Intel Braswell
  usb: dwc3: Reset the transfer resource index on SET_INTERFACE
  usb: gadget: udc: core: Fix argument of dma_map_single for IOMMU
  usb: gadget: mv_udc_core: fix phy_regs I/O memory leak
  usb: ulpi: ulpi_init should be executed in subsys_initcall
  phy: berlin-usb: fix divider for BG2
  phy: berlin-usb: fix divider for BG2CD
  phy/pxa: add HAS_IOMEM dependency
  ...

Merge tag 'tty-4.2-rc4' of git://git./linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
"Here are a number of small serial and tty fixes for reported issues.

  All have been in linux-next successfully"

* tag 'tty-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: vt: Fix !TASK_RUNNING diagnostic warning from paste_selection()
  serial: core: Fix crashes while echoing when closing
  m32r: Add ioreadXX/iowriteXX big-endian mmio accessors
  Revert "serial: imx: initialized DMA w/o HW flow enabled"
  sc16is7xx: fix FIFO address of secondary UART
  sc16is7xx: fix Kconfig dependencies
  serial: etraxfs-uart: Fix release etraxfs_uart_ports
  tty/vt: Fix the memory leak in visual_init
  serial: amba-pl011: Fix devm_ioremap_resource return value check
  n_tty: signal and flush atomically

Merge tag 'staging-4.2-rc4' of git://git./linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are a number of iio and staging driver fixes for reported issues
  for 4.2-rc4.

  All have been in linux-next for a while with no problems"

* tag 'staging-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (34 commits)
  iio:light:stk3310: make endianness independent of host
  iio:light:stk3310: move device register to end of probe
  iio: mma8452: use iio event type IIO_EV_TYPE_MAG
  iio: mcp320x: Fix NULL pointer dereference
  iio: adc: vf610: fix the adc register read fail issue
  iio: mlx96014: Replace offset sign
  iio: magnetometer: mmc35240: fix SET/RESET sequence
  iio: magnetometer: mmc35240: Fix SET/RESET mask
  iio: magnetometer: mmc35240: Fix crash in pm suspend
  iio:magnetometer:bmc150_magn: output intended variable
  iio:magnetometer:bmc150_magn: add regmap dependency
  staging: vt6656: check ieee80211_bss_conf bssid not NULL
  staging: vt6655: check ieee80211_bss_conf bssid not NULL
  iio: tmp006: Check channel info on write
  iio: sx9500: Add missing init in sx9500_buffer_pre{en,dis}able()
  iio:light:ltr501: fix regmap dependency
  iio:light:ltr501: fix variable in ltr501_init
  iio: sx9500: fix bug in compensation code
  iio: sx9500: rework error handling of raw readings
  iio: magnetometer: mmc35240: fix available sampling frequencies
  ...

Merge tag 'char-misc-4.2-rc4' of git://git./linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are some char and misc driver fixes for reported issues.

  One parport patch is reverted as it was incorrect, thanks to testing
  by the 0-day bot"

* tag 'char-misc-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  parport: Revert "parport: fix memory leak"
  mei: prevent unloading mei hw modules while the device is opened.
  misc: mic: scif bug fix for vmalloc_to_page crash
  parport: fix freeing freed memory
  parport: fix memory leak
  parport: fix error handling

parport: Revert "parport: fix memory leak"

This reverts commit 23c405912b88 ("parport: fix memory leak")

par_dev->state was already being removed in parport_unregister_device().

Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Merge tag 'trace-v4.2-rc2-fix3' of git://git./linux/kernel/git/rostedt/linux-trace

Pull ftrace fix from Steven Rostedt:
"Back in 3.16 the ftrace code was redesigned and cleaned up to remove
  the double iteration list (one for registered ftrace ops, and one for
  registered "global" ops), to just use one list.  That simplified the
  code but also broke the function tracing filtering on pid.

  This updates the code to handle the filtering again with the new
  logic"

* tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix breakage of set_ftrace_pid

Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/djbw/nvdimm

Pull libnvdimm fix from Dan Williams:
"A minor fix for the libnvdimm subsystem.

  This is not critical.  The problem can be worked around in userspace
  by putting the namespace temporarily into raw mode
  (ndctl_namespace_set_raw_mode() from libndctl), but that is awkward
  for management utilities.

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm:
  libnvdimm: fix namespace seed creation

Merge tag 'md/4.2-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
"Some md fixes for 4.2

  Several are tagged for -stable.
  A few aren't because they are not very, serious or because they are in
  the 'experimental' cluster code"

* tag 'md/4.2-fixes' of git://neil.brown.name/md:
  md/raid5: clear R5_NeedReplace when no longer needed.
  Fix read-balancing during node failure
  md-cluster: fix bitmap sub-offset in bitmap_read_sb
  md: Return error if request_module fails and returns positive value
  md: Skip cluster setup in case of error while reading bitmap
  md/raid1: fix test for 'was read error from last working device'.
  md: Skip cluster setup for dm-raid
  md: flush ->event_work before stopping array.
  md/raid10: always set reshape_safe when initializing reshape_position.
  md/raid5: avoid races when changing cache size.

Merge tag 'for-linus-20150724' of git://git.infradead.org/linux-mtd

Pull MTD fixes from Brian Norris:
"Two trivial updates.  I meant to send these much earlier, but I've
  been preoccupied.

   - Add MAINTAINERS entry for diskonchip g3 driver

   - Fix an overlooked conflict in bitfield value assignments

  The latter update is a bit overdue, but there's no reason to wait any
  longer"

* tag 'for-linus-20150724' of git://git.infradead.org/linux-mtd:
  mtd: nand: Fix NAND_USE_BOUNCE_BUFFER flag conflict
  MAINTAINERS: mtd: docg3: add docg3 maintainer

libnvdimm: fix namespace seed creation

A new BLK namespace "seed" device is created whenever the current seed
is successfully probed. However, if that namespace is assigned to a BTT
it may never directly experience a successful probe as it is a
subordinate device to a BTT configuration.

The effect of the current code is that no new namespaces can be
instantiated, after the seed namespace, to consume available BLK DPA
capacity. Fix this by treating a successful BTT probe event as a
successful probe event for the backing namespace.

Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Merge branch 'for-linus' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Four smaller fixes for the current series.  This contains:

   - A fix for clones of discard bio's, that can cause data corruption.
     From Martin.

   - A fix for null_blk, where in certain queue modes it could access a
     request after it had been freed.  From Mike Krinkin.

   - An error handling leak fix for blkcg, from Tejun.

   - Also from Tejun, export of the functions that a file system needs
     to implement cgroup writeback support"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: Do a full clone when splitting discard bios
  block: export bio_associate_*() and wbc_account_io()
  blkcg: fix gendisk reference leak in blkg_conf_prep()
  null_blk: fix use-after-free problem

Merge branch 'for-4.2-fixes' of git://git./linux/kernel/git/tj/libata

Pull libata fixes from Tejun Heo:
"A couple important fixes.

   - A block layer change which removed restriction on max transfer size
     led to silent data corruption on some devices.  A new quirk is
     added to restore the old size limit for the reported device.  If it
     gets reported on more devices, we might have to consider restoring
     the restriction for ATA devices by default.

   - There finally is a SSD which is confirmed to cause data corruption
     on TRIM regardless of which flavor is used.  A new quirk is added
     and the device is blacklisted

   - Other device-specific workarounds"

* 'for-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
  libata: Do not blacklist M510DC
  libata: increase the timeout when setting transfer mode
  libata: add ATA_HORKAGE_MAX_SEC_1024 to revert back to previous max_sectors limit
  libata: force disable trim for SuperSSpeed S238
  libata: add ATA_HORKAGE_NOTRIM
  libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for HP 250GB SATA disk VB0250EAVER
  ata: pmp: add quirk for Marvell 4140 SATA PMP

Merge tag 'mmc-4.2-rc3' of git://git.linaro.org/people/ulf.hansson/mmc

Pull MMC fixes from Ulf Hansson:
"Here are some mmc fixes intended for v4.2 rc4.

  Note, most of the changes are for the sdhci-esdhc-imx controller,
  which also required us to modify some related DTS files.  Those
  changes have been acked by the SoC maintainer.

  MMC core:
   - Fix a reference inbalance issue for power_ro_lock_show() sysfs handler

  MMC host:
   - omap_hsmmc: Fix IRQ errorhandling for CD, DTO, and CRC
   - sdhci: Prevent a kernel panic while using DMA
   - mtk-sd: Let it depend on HAS_DMA to prevent build errors
   - sdhci-esdhc: Make 8BIT bus work
   - sdhci-esdhc-imx: Fix some regressions for DT based platforms
   - sdhci-pxav3: Fix a regression for DT based platforms"

* tag 'mmc-4.2-rc3' of git://git.linaro.org/people/ulf.hansson/mmc:
  mmc: sdhci-pxav3: fix platform_data is not initialized
  dts: mmc: fsl-imx-esdhc: remove fsl,cd-controller support
  mmc: sdhci-esdhc-imx: clear f_max in boarddata
  mmc: sdhci-esdhc-imx: remove duplicated dts parsing
  mmc: sdhci: make max-frequency property in device tree work
  mmc: sdhci-esdhc-imx: move all non dt probe code into one function
  mmc: sdhci-esdhc-imx: fix cd regression for dt platform
  dts: imx7: fix sd card gpio polarity specified in device tree
  dts: imx25: fix sd card gpio polarity specified in device tree
  dts: imx6: fix sd card gpio polarity specified in device tree
  dts: imx53: fix sd card gpio polarity specified in device tree
  dts: imx51: fix sd card gpio polarity specified in device tree
  mmc: sdhci-esdhc: Make 8BIT bus work
  mmc: block: Add missing mmc_blk_put() in power_ro_lock_show()
  mmc: MMC_MTK should depend on HAS_DMA
  mmc: sdhci check parameters before call dma_free_coherent
  mmc: omap_hsmmc: Handle BADA, DEB and CEB interrupts
  mmc: omap_hsmmc: Fix DTO and DCRC handling

Merge branch 'for-linus' of git://git./linux/kernel/git/dtor/input

Pull input fixes from Dmitry Torokhov:
"A fix for the warnings/oops when handling HID devices with "unnamed"
  LEDs and couple of other driver fixups""

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: goodix - fix touch coordinates on WinBook TW100 and TW700
  Input: LEDs - skip unnamed LEDs
  Input: usbtouchscreen - avoid unresponsive TSC-30 touch screen
  Input: elantech - force resolution of 31 u/mm
  Input: zforce - don't overwrite the stack

Merge tag 'regulator-fix-v4.2-rc3' of git://git./linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
"As well as some driver specific fixes there's several fixes here for
  the core support for regulators supplying other regulators fixing both
  an issue with ACPI support (which had never been tested before) and
  some error handling and device removal issues that Javier noticed"

* tag 'regulator-fix-v4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: Fix memory leak in regulator_resolve_supply()
  regulator: core: Increase refcount for regulator supply's module
  regulator: core: Handle full constraints systems when resolving supplies
  regulator: 88pm800: fix LDO vsel_mask value
  regulator: max8973: Fix up control flag option for bias control
  regulator: s2mps11: Fix GPIO suspend enable shift wrapping bug

Merge tag 'spi-fix-v4.2-rc3' of git://git./linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A small collection of pretty much unremarkable driver specific fixes
  here plus the addition of a new device ID to spidev which requires no
  other code changes"

* tag 'spi-fix-v4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: imx: Fix small DMA transfers
  spi: zynq: missing break statement
  spi: SPI_ZYNQMP_GQSPI should depend on HAS_DMA
  spi: spidev: add compatible value for LTC2488
  spi: img-spfi: fix support for speeds up to 1/4th input clock

Merge tag 'sound-4.2-rc4' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"This has been a calm week again: one minor lockdep fix for PCM core,
  and the most of the rest are HD-audio quirks and fixups for various
  chips and machines"

* tag 'sound-4.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Add headset mic pin quirk for a Dell device
  ALSA: hda - remove one pin from ALC292_STANDARD_PINS
  ALSA: hda - Add new GPU codec ID 0x10de007d to snd-hda
  ALSA: hda: add new AMD PCI IDs with proper driver caps
  ALSA: hda - Fix Skylake codec timeout
  ALSA: hda - Add headset mic support for Acer Aspire V5-573G
  ALSA: sparc: Add missing kfree in error path
  ALSA: pcm: Fix lockdep warning with nonatomic PCM ops

Merge branch 'for-linus' of git://git./linux/kernel/git/jikos/hid

Pull HID fixes from Jiri Kosina:

- kernel crash fixes for multitouch and wacom drivers, by Brent Adam
   and Dan Carpenter

- cp2112 data packet race condition corruption fix, by Antonio Borneo

- a few new device IDs for wacom and microsoft drivers

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
  HID: cp2112: fix to force single data-report reply
  HID: wacom: Enable pad device for older Bamboo Touch tablets
  HID: multitouch: Fix fields from pen report ID being interpreted for multitouch
  HID: microsoft: Add quirk for MS Surface Type/Touch cover
  HID: wacom: NULL dereferences on error in probe()

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"Aome amdgpu, one i915, one ttm and one hlcdc, nothing too scary.

  All seems fine for about this time"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/ttm: recognize ARM64 arch in ioprot handler
  drm/amdgpu/cz/dpm: properly report UVD and VCE clock levels
  drm/amdgpu/cz: implement voltage validation properly
  drm/amdgpu: add VCE harvesting instance query
  drm/amdgpu: implement VCE 3.0 harvesting support (v4)
  drm/amdgpu/dce10: Re-set VBLANK interrupt state when enabling a CRTC
  drm/amdgpu/dce11: Re-set VBLANK interrupt state when enabling a CRTC
  drm: Stop resetting connector state to unknown
  drm/i915: Use two 32bit reads for select 64bit REG_READ ioctls
  drm: atmel-hlcdc: fix vblank initial state

Merge branch 'stable' of git://git./linux/kernel/git/cmetcalf/linux-tile

Pull arch/tile bugfix from Chris Metcalf:
"This fixes a bug in freeing the initramfs memory"

* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: use free_bootmem_late() for initrd

Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
"Everything related to the new quirks and memory type features:

   - small improvements to the quirks API

   - extending one of the quirks from just AMD to Intel as well, because
     4.2 can show the same problem with problematic firmware on Intel
     too"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: rename quirk constants to KVM_X86_QUIRK_*
  KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
  KVM: x86: introduce kvm_check_has_quirk
  KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
  KVM: MTRR: fix memory type handling if MTRR is completely disabled

ftrace: Fix breakage of set_ftrace_pid

Commit 4104d326b670 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.

Add back the proper hooks to have pid filtering working once again.

Cc: stable@vger.kernel.org # 3.16+
Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Input: goodix - fix touch coordinates on WinBook TW100 and TW700

The touchscreen on the WinBook TW100 and TW700 don't match the default
display, with 0,0 touches being reported when touching at the bottom
right of the screen.

  1280,800             0,800
         +-------------+
         |             |
         |             |
         |             |
         +-------------+
    1280,0             0,0

It's unfortunately impossible to detect this problem with data from the
DSDT, or other auxiliary metadata, so fallback to quirking this specific
model of tablet instead.

Signed-off-by: Bastien Nocera <hadess@hadess.net>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Input: LEDs - skip unnamed LEDs

Devices may declare more LEDs than what is known to input-leds
(HID does this for some devices). Instead of showing ugly warnings
on connect and, even worse, oopsing on disconnect, let's simply
ignore LEDs that are not known to us.

Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge remote-tracking branches 'spi/fix/gqspi', 'spi/fix/imx', 'spi/fix/mg-spfi' and 'spi/fix/spidev' into spi-linus

Merge remote-tracking branches 'regulator/fix/88pm800', 'regulator/fix/max8973', 'regulator/fix/s2mps11' and 'regulator/fix/supply' into regulator-linus

Merge remote-tracking branch 'regulator/fix/core' into regulator-linus

spi: imx: Fix small DMA transfers

DMA transfers must be greater than the watermark level size. spi_imx->rx_wml
and spi_imx->tx_wml contain the watermark level in 32bit words whereas struct
spi_transfer contains the transfer len in bytes. Fix the check if DMA is
possible for a transfer accordingly. This fixes transfers with sizes between
33 and 128 bytes for which previously was claimed that DMA is possible.

Fixes: f62caccd12c17e4 (spi: spi-imx: add DMA support)
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>

x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit

This change reverts most of commit 53e9accf0f 'Do not use R9 in
SYSCALL32'. I don't yet understand how, but code in that commit
sometimes fails to preserve EBP.

See https://bugzilla.kernel.org/show_bug.cgi?id=101061
"Problems while executing 32-bit code on AMD64"

Reported-and-tested-by: Krzysztof A. Sobiecki <sobkas@gmail.com>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Will Drewry <wad@chromium.org>
Cc: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/1437740203-11552-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

x86/mm: Fix newly introduced printk format warnings

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

xen-blkback: replace work_pending with work_busy in purge_persistent_gnt()

The BUG_ON() in purge_persistent_gnt() will be triggered when previous purge
work haven't finished.

There is a work_pending() before this BUG_ON, but it doesn't account if the work
is still currently running.

CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: don't add indirect pages to list when !feature_persistent

We should consider info->feature_persistent when adding indirect page to list
info->indirect_pages, else the BUG_ON() in blkif_free() would be triggered.

When we are using persistent grants the indirect_pages list
should always be empty because blkfront has pre-allocated enough
persistent pages to fill all requests on the ring.

CC: stable@vger.kernel.org
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: introduce blkfront_gather_backend_features()

There is a bug when migrate from !feature-persistent host to feature-persistent
host, because domU still thinks new host/backend doesn't support persistent.
Dmesg like:
backed has not unmapped grant: 839
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 773
backed has not unmapped grant: 839

The fix is to recheck feature-persistent of new backend in blkif_recover().
See: https://lkml.org/lkml/2015/5/25/469

As Roger suggested, we can split the part of blkfront_connect that checks for
optional features, like persistent grants, indirect descriptors and
flush/barrier features to a separate function and call it from both
blkfront_connect and blkif_recover

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

mmc: sdhci-pxav3: fix platform_data is not initialized

pdev->dev.platform_data is not initialized if match is true in function
sdhci_pxav3_probe. Just local variable pdata is assigned the return value
from function pxav3_get_mmc_pdata().

static int sdhci_pxav3_probe(struct platform_device *pdev) {

    struct sdhci_pxa_platdata *pdata = pdev->dev.platform_data;
    ...
    if (match) {
ret = mmc_of_parse(host->mmc);
if (ret)
goto err_of_parse;
sdhci_get_of_property(pdev);
pdata = pxav3_get_mmc_pdata(dev);
     }
     ...
}

Signed-off-by: Jingju Hou <houjingj@marvell.com>
Fixes: b650352dd3df("mmc: sdhci-pxa: Add device tree support")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

dts: mmc: fsl-imx-esdhc: remove fsl,cd-controller support

It's not supported by driver anymore after using runtime pm
and there's no user of it, so delete it now.

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

mmc: sdhci-esdhc-imx: clear f_max in boarddata

After commit 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
it's not used anymore.

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

mmc: sdhci-esdhc-imx: remove duplicated dts parsing

After commit 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
we do not need those duplicated parsing anymore.

Note: fsl,cd-controller is also deleted due to the driver does
not support controller card detection anymore after switch to runtime pm.
And there's no user of it right now in device tree.

wp-gpios is kept because we're still support fsl,wp-controller,
so we need a way to check if it's gpio wp or controller wp.

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

mmc: sdhci: make max-frequency property in device tree work

Device tree provides option to specify the max freqency with property
"max-frequency" in dts and common parse function mmc_of_parse() will
parse it and use this value to set host->f_max to tell the MMC core
the maxinum frequency the host works.

However, current sdhci driver will finally overwrite this value with
host->max_clk regardless of the max-frequency property.

This patch makes sure not overwrite the max-frequency set from device
tree and do basic sanity check.

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

mmc: sdhci-esdhc-imx: move all non dt probe code into one function

This is an incremental fix of commit
e62bd351b("mmc: sdhci-esdhc-imx: Do not break platform data boards").

After commit 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()"),
we do not need to run the check of boarddata->wp_type/cd_type/max_bus_width
again for dt platform since those are already handled by mmc_of_parse().

Current code only exclude the checking of wp_type for dt platform which
does not make sense.

This patch moves all non dt probe code into one function.
Besides, since we only support SD3.0/eMMC HS200 for dt platform, the
support_vsel checking and ultra high speed pinctrl state are also merged
into sdhci_esdhc_imx_probe_dt.

Then we have two separately probe function for dt and non dt type.
This can make the driver probe more clearly.

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

mmc: sdhci-esdhc-imx: fix cd regression for dt platform

Current card detect probe process is that when driver finds a valid
ESDHC_CD_GPIO, it will clear the quirk SDHCI_QUIRK_BROKEN_CARD_DETECTION
which is set by default for all esdhc/usdhc controllers.
Then host driver will know there's a valid card detect function.

Commit 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()")
breaks GPIO CD function for dt platform that it will return directly
when find ESDHC_CD_GPIO for dt platform which result in the later wrongly
to keep SDHCI_QUIRK_BROKEN_CARD_DETECTION for all dt platforms.
Then MMC_CAP_NEEDS_POLL will be used instead even there's a valid
GPIO card detect.

This patch adds back this function and follows the original approach to
clear the quirk if find an valid CD GPIO for dt platforms.

Fixes: 8d86e4fcccf6 ("mmc: sdhci-esdhc-imx: Call mmc_of_parse()")
Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Reviewed-by: Johan Derycke <johan.derycke@barco.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

dts: imx7: fix sd card gpio polarity specified in device tree

cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

dts: imx25: fix sd card gpio polarity specified in device tree

cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

dts: imx6: fix sd card gpio polarity specified in device tree

cd-gpios polarity should be changed to GPIO_ACTIVE_LOW and wp-gpios
should be changed to GPIO_ACTIVE_HIGH.
Otherwise, the SD may not work properly due to wrong polarity inversion
specified in DT after switch to common parsing function mmc_of_parse().

Signed-off-by: Dong Aisheng <aisheng.dong@freescale.com>
Acked-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>