openwrt/staging/blogic.git
4 years agodm bio record: save/restore bi_end_io and bi_integrity
Mike Snitzer [Fri, 28 Feb 2020 23:00:53 +0000 (18:00 -0500)]
dm bio record: save/restore bi_end_io and bi_integrity

Also, save/restore __bi_remaining in case the bio was used in a
BIO_CHAIN (e.g. due to blk_queue_split).

Suggested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm zoned: Fix reference counter initial value of chunk works
Shin'ichiro Kawasaki [Thu, 27 Feb 2020 00:18:52 +0000 (09:18 +0900)]
dm zoned: Fix reference counter initial value of chunk works

Dm-zoned initializes reference counters of new chunk works with zero
value and refcount_inc() is called to increment the counter. However, the
refcount_inc() function handles the addition to zero value as an error
and triggers the warning as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
...
CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
...
Call Trace:
 dmz_map+0x2d2/0x350 [dm_zoned]
 __map_bio+0x42/0x1a0
 __split_and_process_non_flush+0x14a/0x1b0
 __split_and_process_bio+0x83/0x240
 ? kmem_cache_alloc+0x165/0x220
 dm_process_bio+0x90/0x230
 ? generic_make_request_checks+0x2e7/0x680
 dm_make_request+0x3e/0xb0
 generic_make_request+0xcf/0x320
 ? memcg_drain_all_list_lrus+0x1c0/0x1c0
 submit_bio+0x3c/0x160
 ? guard_bio_eod+0x2c/0x130
 mpage_readpages+0x182/0x1d0
 ? bdev_evict_inode+0xf0/0xf0
 read_pages+0x6b/0x1b0
 __do_page_cache_readahead+0x1ba/0x1d0
 force_page_cache_readahead+0x93/0x100
 generic_file_read_iter+0x83a/0xe40
 ? __seccomp_filter+0x7b/0x670
 new_sync_read+0x12a/0x1c0
 vfs_read+0x9d/0x150
 ksys_read+0x5f/0xe0
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
...

After this warning, following refcount API calls for the counter all fail
to change the counter value.

Fix this by setting the initial reference counter value not zero but one
for the new chunk works. Instead, do not call refcount_inc() via
dmz_get_chunk_work() for the new chunks works.

The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
enabled. Refcount rework was merged to linux version 5.5 by the
commit 168829ad09ca ("Merge branch 'locking-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
regardless of kernel configuration.

Linux version 4.20 merged the commit 092b5648760a ("dm zoned: target: use
refcount_t for dm zoned reference counters"). Before this commit, dm
zoned used atomic_t APIs which does not check addition to zero, then this
fix is not necessary.

Fixes: 092b5648760a ("dm zoned: target: use refcount_t for dm zoned reference counters")
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm writecache: verify watermark during resume
Mikulas Patocka [Mon, 24 Feb 2020 09:20:30 +0000 (10:20 +0100)]
dm writecache: verify watermark during resume

Verify the watermark upon resume - so that if the target is reloaded
with lower watermark, it will start the cleanup process immediately.

Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm: report suspended device during destroy
Mikulas Patocka [Mon, 24 Feb 2020 09:20:28 +0000 (10:20 +0100)]
dm: report suspended device during destroy

The function dm_suspended returns true if the target is suspended.
However, when the target is being suspended during unload, it returns
false.

An example where this is a problem: the test "!dm_suspended(wc->ti)" in
writecache_writeback is not sufficient, because dm_suspended returns
zero while writecache_suspend is in progress.  As is, without an
enhanced dm_suspended, simply switching from flush_workqueue to
drain_workqueue still emits warnings:
workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries

writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
flushes the current work. However, the workqueue may re-queue itself and
flush_workqueue doesn't wait for re-queued works to finish. Because of
this - the function writecache_writeback continues execution after the
device was suspended and then concurrently with writecache_dtr, causing
a crash in writecache_writeback.

We must use drain_workqueue - that waits until the work and all re-queued
works finish.

As a prereq for switching to drain_workqueue, this commit fixes
dm_suspended to return true after the presuspend hook and before the
postsuspend hook - just like during a normal suspend. It allows
simplifying the dm-integrity and dm-writecache targets so that they
don't have to maintain suspended flags on their own.

With this change use of drain_workqueue() can be used effectively.  This
change was tested with the lvm2 testsuite and cryptsetup testsuite and
the are no regressions.

Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Reported-by: Corey Marthaler <cmarthal@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin metadata: fix lockdep complaint
Theodore Ts'o [Sun, 23 Feb 2020 19:54:58 +0000 (14:54 -0500)]
dm thin metadata: fix lockdep complaint

[ 3934.173244] ======================================================
[ 3934.179572] WARNING: possible circular locking dependency detected
[ 3934.185884] 5.4.21-xfstests #1 Not tainted
[ 3934.190151] ------------------------------------------------------
[ 3934.196673] dmsetup/8897 is trying to acquire lock:
[ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
[ 3934.210268]
               but task is already holding lock:
[ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
[ 3934.225083]
               which lock already depends on the new lock.

[ 3934.564165] Chain exists of:
                 shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock

For a more detailed lockdep report, please see:

https://lore.kernel.org/r/20200220234519.GA620489@mit.edu

We shouldn't need to hold the lock while are just tearing down and
freeing the whole metadata pool structure.

Fixes: 44d8ebf436399a4 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm cache: fix a crash due to incorrect work item cancelling
Mikulas Patocka [Wed, 19 Feb 2020 15:25:45 +0000 (10:25 -0500)]
dm cache: fix a crash due to incorrect work item cancelling

The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
  while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.

The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.

Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.

Fixes: c6b4fcbad044 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm integrity: fix invalid table returned due to argument count mismatch
Mikulas Patocka [Mon, 17 Feb 2020 13:11:35 +0000 (08:11 -0500)]
dm integrity: fix invalid table returned due to argument count mismatch

If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
not specified on the command line (i.e. ic->recalculate_flag is false),
dm-integrity would return invalid table line - the reported number of
arguments would not match the real number.

Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm integrity: fix a deadlock due to offloading to an incorrect workqueue
Mikulas Patocka [Mon, 17 Feb 2020 12:43:03 +0000 (07:43 -0500)]
dm integrity: fix a deadlock due to offloading to an incorrect workqueue

If we need to perform synchronous I/O in dm_integrity_map_continue(),
we must make sure that we are not in the map function - in order to
avoid the deadlock due to bio queuing in generic_make_request. To
avoid the deadlock, we offload the request to metadata_wq.

However, metadata_wq also processes metadata updates for write requests.
If there are too many requests that get offloaded to metadata_wq at the
beginning of dm_integrity_map_continue, the workqueue metadata_wq
becomes clogged and the system is incapable of processing any metadata
updates.

This causes a deadlock because all the requests that need to do metadata
updates wait for metadata_wq to proceed and metadata_wq waits inside
wait_and_add_new_range until some existing request releases its range
lock (which doesn't happen because the range lock is released after
metadata update).

In order to fix the deadlock, we create a new workqueue offload_wq and
offload requests to it - so that processing of offload_wq is independent
from processing of metadata_wq.

Fixes: 7eada909bfd7 ("dm: add integrity target")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm integrity: fix recalculation when moving from journal mode to bitmap mode
Mikulas Patocka [Fri, 7 Feb 2020 16:42:30 +0000 (11:42 -0500)]
dm integrity: fix recalculation when moving from journal mode to bitmap mode

If we resume a device in bitmap mode and the on-disk format is in journal
mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
there would be non-recalculated blocks which would cause I/O errors.

Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm: fix potential for q->make_request_fn NULL pointer
Mike Snitzer [Mon, 27 Jan 2020 19:07:23 +0000 (14:07 -0500)]
dm: fix potential for q->make_request_fn NULL pointer

Move blk_queue_make_request() to dm.c:alloc_dev() so that
q->make_request_fn is never NULL during the lifetime of a DM device
(even one that is created without a DM table).

Otherwise generic_make_request() will crash simply by doing:
  dmsetup create -n test
  mount /dev/dm-N /mnt

While at it, move ->congested_data initialization out of
dm.c:alloc_dev() and into the bio-based specific init method.

Reported-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1860231
Fixes: ff36ab34583a ("dm: remove request-based logic from make_request_fn wrapper")
Depends-on: c12c9a3c3860c ("dm: various cleanups to md->queue initialization code")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm writecache: improve performance of large linear writes on SSDs
Mikulas Patocka [Wed, 15 Jan 2020 09:35:22 +0000 (04:35 -0500)]
dm writecache: improve performance of large linear writes on SSDs

When dm-writecache is used with SSD as a cache device, it would submit a
separate bio for each written block. The I/Os would be merged by the disk
scheduler, but this merging degrades performance.

Improve dm-writecache performance by submitting larger bios - this is
possible as long as there is consecutive free space on the cache
device.

Benchmark (arm64 with 64k page size, using /dev/ram0 as a cache device):

fio --bs=512k --iodepth=32 --size=400M --direct=1 \
    --filename=/dev/mapper/cache --rw=randwrite --numjobs=1 --name=test

block old new
size MiB/s MiB/s
---------------------
512 181 700
1k 347 1256
2k 644 2020
4k 1183 2759
8k 1852 3333
16k 2469 3509
32k 2974 3670
64k 3404 3810

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm mpath: Add timeout mechanism for queue_if_no_path
Anatol Pomazau [Mon, 13 Jan 2020 22:41:27 +0000 (17:41 -0500)]
dm mpath: Add timeout mechanism for queue_if_no_path

Add a configurable timeout mechanism to disable queue_if_no_path without
assistance from userspace multipathd.  This reimplements multipathd's
no_path_retry mechanism in kernel space.  This is motivated by the
desire to prevent processes from hanging indefinitely waiting for IO
in cases where multipathd might be unable to respond (after a failure
or for whatever reason).

Despite replicating userspace multipathd's policy configuration in
kernel space, it is important to prevent IOs from hanging forever,
waiting for userspace that may be incapable of behaving correctly.

Use of the provided "queue_if_no_path_timeout_secs" dm-multipath
module parameter is optional.  This timeout mechanism is disabled by
default (by being set to 0).

Signed-off-by: Anatol Pomazau <anatol@google.com>
Co-developed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin: change data device's flush_bio to be member of struct pool
Mikulas Patocka [Mon, 13 Jan 2020 20:41:53 +0000 (15:41 -0500)]
dm thin: change data device's flush_bio to be member of struct pool

With commit fe64369163c5 ("dm thin: don't allow changing data device
during thin-pool load") it is now possible to re-parent the data
device's flush_bio from the pool_c to pool structure.  Doing so offers
improved lifetime guarantees for the flush_bio so that the call to
dm_pool_register_pre_commit_callback can now be done safely from
pool_ctr().

Depends-on: fe64369163c5 ("dm thin: don't allow changing data device during thin-pool load")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin: don't allow changing data device during thin-pool reload
Mikulas Patocka [Mon, 13 Jan 2020 20:04:37 +0000 (15:04 -0500)]
dm thin: don't allow changing data device during thin-pool reload

The existing code allows changing the data device when the thin-pool
target is reloaded.

This capability is not required and only complicates device lifetime
guarantees. This can cause crashes like the one reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=1788596
where the kernel tries to issue a flush bio located in a structure that
was already freed.

Take the first step to simplifying the thin-pool's data device lifetime
by disallowing changing it. Like the thin-pool's metadata device, the
data device is now set in pool_create() and it cannot be changed for a
given thin-pool.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin: fix use-after-free in metadata_pre_commit_callback
Mike Snitzer [Mon, 13 Jan 2020 17:29:04 +0000 (12:29 -0500)]
dm thin: fix use-after-free in metadata_pre_commit_callback

dm-thin uses struct pool to hold the state of the pool. There may be
multiple pool_c's pointing to a given pool, each pool_c represents a
loaded target. pool_c's may be created and destroyed arbitrarily and the
pool contains a reference count of pool_c's pointing to it.

Since commit 694cfe7f31db3 ("dm thin: Flush data device before
committing metadata") a pointer to pool_c is passed to
dm_pool_register_pre_commit_callback and this function stores it in
pmd->pre_commit_context. If this pool_c is freed, but pool is not
(because there is another pool_c referencing it), we end up in a
situation where pmd->pre_commit_context structure points to freed
pool_c. It causes a crash in metadata_pre_commit_callback.

Fix this by moving the dm_pool_register_pre_commit_callback() from
pool_ctr() to pool_preresume(). This way the in-core thin-pool metadata
is only ever armed with callback data whose lifetime matches the
active thin-pool target.

In should be noted that this fix preserves the ability to load a
thin-pool table that uses a different data block device (that contains
the same data) -- though it is unclear if that capability is still
useful and/or needed.

Fixes: 694cfe7f31db3 ("dm thin: Flush data device before committing metadata")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin metadata: use pool locking at end of dm_pool_metadata_close
Mike Snitzer [Mon, 13 Jan 2020 16:18:51 +0000 (11:18 -0500)]
dm thin metadata: use pool locking at end of dm_pool_metadata_close

Ensure that the pool is locked during calls to __commit_transaction and
__destroy_persistent_data_objects.  Just being consistent with locking,
but reality is dm_pool_metadata_close is called once pool is being
destroyed so access to pool shouldn't be contended.

Also, use pmd_write_lock_in_core rather than __pmd_write_lock in
dm_pool_commit_metadata and rename __pmd_write_lock to
pmd_write_lock_in_core -- there was no need for the alias.

In addition, verify that the pool is locked in __commit_transaction().

Fixes: 873f258becca ("dm thin metadata: do not write metadata if no changes occurred")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm writecache: fix incorrect flush sequence when doing SSD mode commit
Mikulas Patocka [Wed, 8 Jan 2020 15:46:05 +0000 (10:46 -0500)]
dm writecache: fix incorrect flush sequence when doing SSD mode commit

When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache

It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.

In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.

Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm crypt: fix benbi IV constructor crash if used in authenticated mode
Milan Broz [Mon, 6 Jan 2020 09:11:47 +0000 (10:11 +0100)]
dm crypt: fix benbi IV constructor crash if used in authenticated mode

If benbi IV is used in AEAD construction, for example:
  cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
the constructor uses wrong skcipher function and crashes:

 BUG: kernel NULL pointer dereference, address: 00000014
 ...
 EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
 Call Trace:
  ? crypt_subkey_size+0x20/0x20 [dm_crypt]
  crypt_ctr+0x567/0xfc0 [dm_crypt]
  dm_table_add_target+0x15f/0x340 [dm_mod]

Fix this by properly using crypt_aead_blocksize() in this case.

Fixes: ef43aa38063a6 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
Cc: stable@vger.kernel.org # v4.12+
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051
Reported-by: Jerad Simpson <jbsimpson@gmail.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm crypt: Implement Elephant diffuser for Bitlocker compatibility
Milan Broz [Fri, 3 Jan 2020 08:20:22 +0000 (09:20 +0100)]
dm crypt: Implement Elephant diffuser for Bitlocker compatibility

Add experimental support for BitLocker encryption with CBC mode and
additional Elephant diffuser.

The mode was used in older Windows systems and it is provided mainly
for compatibility reasons. The userspace support to activate these
devices is being added to cryptsetup utility.

Read-write activation of such a device is very simple, for example:
  echo <password> | cryptsetup bitlkOpen bitlk_image.img test

The Elephant diffuser uses two rotations in opposite direction for
data (Diffuser A and B) and also XOR operation with Sector key over
the sector data; Sector key is derived from additional key data. The
original public documentation is available here:
  http://download.microsoft.com/download/0/2/3/0238acaf-d3bf-4a6d-b3d6-0a0be4bbb36e/bitlockercipher200608.pdf

The dm-crypt implementation is embedded to "elephant" IV (similar to
tcw IV construction).

Because we cannot modify original bio data for write (before
encryption), an additional internal flag to pre-process data is
added.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm space map common: fix to ensure new block isn't already in use
Joe Thornber [Tue, 7 Jan 2020 11:58:42 +0000 (11:58 +0000)]
dm space map common: fix to ensure new block isn't already in use

The space-maps track the reference counts for disk blocks allocated by
both the thin-provisioning and cache targets.  There are variants for
tracking metadata blocks and data blocks.

Transactionality is implemented by never touching blocks from the
previous transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has
reference count of 0) in both the current and previous transaction.
Prior to this fix we were doing this by searching for a free block in
the previous transaction, and relying on a 'begin' counter to track
where the last allocation in the current transaction was.  This
'begin' field was not being updated in all code paths (eg, increment
of a data block reference count due to breaking sharing of a neighbour
block in the same btree leaf).

This fix keeps the 'begin' field, but now it's just a hint to speed up
the search.  Instead the current transaction is searched for a free
block, and then the old transaction is double checked to ensure it's
free.  Much simpler.

This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
DM thin-provisioning's snapshots are heavily used.

Reported-by: Eric Wheeler <dm-devel@lists.ewheeler.net>
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm verity: don't prefetch hash blocks for already-verified data
xianrong.zhou [Tue, 7 Jan 2020 04:34:24 +0000 (20:34 -0800)]
dm verity: don't prefetch hash blocks for already-verified data

Try to skip prefetching hash blocks that won't be needed due to the
"check_at_most_once" option being enabled and the corresponding data
blocks already having been verified.

Since prefetching operates on a range of data blocks, do this by just
trimming the two ends of the range.  This doesn't skip every unneeded
hash block, since data blocks in the middle of the range could also be
unneeded, and hash blocks are still prefetched in large clusters as
controlled by dm_verity_prefetch_cluster.  But it can still help a lot.

In a test on Android Q launching 91 apps every 15s repeated 21 times,
prefetching was only done for 447177/4776629 = 9.36% of data blocks.

Tested-by: ruxian.feng <ruxian.feng@transsion.com>
Co-developed-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: yuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: xianrong.zhou <xianrong.zhou@transsion.com>
[EB: simplified the 'while' loops and improved the commit message]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm crypt: fix GFP flags passed to skcipher_request_alloc()
Mikulas Patocka [Thu, 2 Jan 2020 13:23:32 +0000 (08:23 -0500)]
dm crypt: fix GFP flags passed to skcipher_request_alloc()

GFP_KERNEL is not supposed to be or'd with GFP_NOFS (the result is
equivalent to GFP_KERNEL). Also, we use GFP_NOIO instead of GFP_NOFS
because we don't want any I/O being submitted in the direct reclaim
path.

Fixes: 39d13a1ac41d ("dm crypt: reuse eboiv skcipher for IV generation")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin metadata: Fix trivial math error in on-disk format documentation
Jeffle Xu [Mon, 30 Dec 2019 02:54:32 +0000 (10:54 +0800)]
dm thin metadata: Fix trivial math error in on-disk format documentation

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm thin metadata: use true/false for bool variable
zhengbin [Tue, 24 Dec 2019 06:38:03 +0000 (14:38 +0800)]
dm thin metadata: use true/false for bool variable

Fixes coccicheck warning:

drivers/md/dm-thin-metadata.c:814:3-14: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1109:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1621:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1652:1-12: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-thin-metadata.c:1706:1-12: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm snapshot: use true/false for bool variable
zhengbin [Tue, 24 Dec 2019 06:38:02 +0000 (14:38 +0800)]
dm snapshot: use true/false for bool variable

Fixes coccicheck warning:

drivers/md/dm-snap.c:1064:3-18: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1152:1-16: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1317:1-16: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm bio prison v2: use true/false for bool variable
zhengbin [Tue, 24 Dec 2019 06:38:01 +0000 (14:38 +0800)]
dm bio prison v2: use true/false for bool variable

Fixes coccicheck warning:

drivers/md/dm-bio-prison-v2.c:327:2-22: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm mpath: use true/false for bool variable
zhengbin [Tue, 24 Dec 2019 06:38:00 +0000 (14:38 +0800)]
dm mpath: use true/false for bool variable

Fixes coccicheck warning:

drivers/md/dm-mpath.c:1447:2-13: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm zoned: support zone sizes smaller than 128MiB
Dmitry Fomichev [Tue, 24 Dec 2019 01:05:46 +0000 (17:05 -0800)]
dm zoned: support zone sizes smaller than 128MiB

dm-zoned is observed to log failed kernel assertions and not work
correctly when operating against a device with a zone size smaller
than 128MiB (e.g. 32768 bits per 4K block). The reason is that the
bitmap size per zone is calculated as zero with such a small zone
size. Fix this problem and also make the code related to zone bitmap
management be able to handle per zone bitmaps smaller than a single
block.

A dm-zoned-tools patch is required to properly format dm-zoned devices
with zone sizes smaller than 128MiB.

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm raid: table line rebuild status fixes
Heinz Mauelshagen [Thu, 19 Dec 2019 16:58:46 +0000 (17:58 +0100)]
dm raid: table line rebuild status fixes

raid_status() wasn't emitting rebuild flags on the table line properly
because the rdev number was not yet set properly; index raid component
devices array directly to solve.

Also fix wrong argument count on emitted table line caused by 1 too
many rebuild/write_mostly argument and consider any journal_(dev|mode)
pairs.

Link: https://bugzilla.redhat.com/1782045
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agodm dust: change ret to r in dust_map_write
Bryan Gurney [Thu, 5 Dec 2019 22:08:37 +0000 (17:08 -0500)]
dm dust: change ret to r in dust_map_write

In the dust_map_write() function, change the return code variable
"ret" to "r", to match the convention of the other device-mapper
targets.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
4 years agoLinux 5.5-rc5
Linus Torvalds [Sun, 5 Jan 2020 22:23:27 +0000 (14:23 -0800)]
Linux 5.5-rc5

4 years agoMerge tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv...
Linus Torvalds [Sun, 5 Jan 2020 19:15:31 +0000 (11:15 -0800)]
Merge tag 'riscv/for-v5.5-rc5' of git://git./linux/kernel/git/riscv/linux

Pull RISC-V fixes from Paul Walmsley:
 "Several fixes for RISC-V:

   - Fix function graph trace support

   - Prefix the CSR IRQ_* macro names with "RV_", to avoid collisions
     with macros elsewhere in the Linux kernel tree named "IRQ_TIMER"

   - Use __pa_symbol() when computing the physical address of a kernel
     symbol, rather than __pa()

   - Mark the RISC-V port as supporting GCOV

  One DT addition:

   - Describe the L2 cache controller in the FU540 DT file

  One documentation update:

   - Add patch acceptance guideline documentation"

* tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  Documentation: riscv: add patch acceptance guidelines
  riscv: prefix IRQ_ macro names with an RV_ namespace
  clocksource: riscv: add notrace to riscv_sched_clock
  riscv: ftrace: correct the condition logic in function graph tracer
  riscv: dts: Add DT support for SiFive L2 cache controller
  riscv: gcov: enable gcov for RISC-V
  riscv: mm: use __pa_symbol for kernel symbols

4 years agoDocumentation: riscv: add patch acceptance guidelines
Paul Walmsley [Sat, 23 Nov 2019 02:33:28 +0000 (18:33 -0800)]
Documentation: riscv: add patch acceptance guidelines

Formalize, in kernel documentation, the patch acceptance policy for
arch/riscv.  In summary, it states that as maintainers, we plan to
only accept patches for new modules or extensions that have been
frozen or ratified by the RISC-V Foundation.

We've been following these guidelines for the past few months.  In the
meantime, we've received quite a bit of feedback that it would be
helpful to have these guidelines formally documented.

Based on a suggestion from Matthew Wilcox, we also add a link to this
file to Documentation/process/index.rst, to make this document easier
to find.  The format of this document has also been changed to align
to the format outlined in the maintainer entry profiles, in accordance
with comments from Jon Corbet and Dan Williams.

Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Krste Asanovic <krste@berkeley.edu>
Cc: Andrew Waterman <waterman@eecs.berkeley.edu>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
4 years agoriscv: prefix IRQ_ macro names with an RV_ namespace
Paul Walmsley [Fri, 20 Dec 2019 11:09:49 +0000 (03:09 -0800)]
riscv: prefix IRQ_ macro names with an RV_ namespace

"IRQ_TIMER", used in the arch/riscv CSR header file, is a sufficiently
generic macro name that it's used by several source files across the
Linux code base.  Some of these other files ultimately include the
arch/riscv CSR include file, causing collisions.  Fix by prefixing the
RISC-V csr.h IRQ_ macro names with an RV_ prefix.

Fixes: a4c3733d32a72 ("riscv: abstract out CSR names for supervisor vs machine mode")
Reported-by: Olof Johansson <olof@lixom.net>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoclocksource: riscv: add notrace to riscv_sched_clock
Zong Li [Mon, 23 Dec 2019 08:46:14 +0000 (16:46 +0800)]
clocksource: riscv: add notrace to riscv_sched_clock

When enabling ftrace graph tracer, it gets the tracing clock in
ftrace_push_return_trace().  Eventually, it invokes riscv_sched_clock()
to get the clock value.  If riscv_sched_clock() isn't marked with
'notrace', it will call ftrace_push_return_trace() and cause infinite
loop.

The result of failure as follow:

command: echo function_graph >current_tracer
[   46.176787] Unable to handle kernel paging request at virtual address ffffffe04fb38c48
[   46.177309] Oops [#1]
[   46.177478] Modules linked in:
[   46.177770] CPU: 0 PID: 256 Comm: $d Not tainted 5.5.0-rc1 #47
[   46.177981] epc: ffffffe00035e59a ra : ffffffe00035e57e sp : ffffffe03a7569b0
[   46.178216]  gp : ffffffe000d29b90 tp : ffffffe03a756180 t0 : ffffffe03a756968
[   46.178430]  t1 : ffffffe00087f408 t2 : ffffffe03a7569a0 s0 : ffffffe03a7569f0
[   46.178643]  s1 : ffffffe00087f408 a0 : 0000000ac054cda4 a1 : 000000000087f411
[   46.178856]  a2 : 0000000ac054cda4 a3 : 0000000000373ca0 a4 : ffffffe04fb38c48
[   46.179099]  a5 : 00000000153e22a8 a6 : 00000000005522ff a7 : 0000000000000005
[   46.179338]  s2 : ffffffe03a756a90 s3 : ffffffe00032811c s4 : ffffffe03a756a58
[   46.179570]  s5 : ffffffe000d29fe0 s6 : 0000000000000001 s7 : 0000000000000003
[   46.179809]  s8 : 0000000000000003 s9 : 0000000000000002 s10: 0000000000000004
[   46.180053]  s11: 0000000000000000 t3 : 0000003fc815749c t4 : 00000000000efc90
[   46.180293]  t5 : ffffffe000d29658 t6 : 0000000000040000
[   46.180482] status: 0000000000000100 badaddr: ffffffe04fb38c48 cause: 000000000000000f

Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[paul.walmsley@sifive.com: cleaned up patch description]
Fixes: 92e0d143fdef ("clocksource/drivers/riscv_timer: Provide the sched_clock")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 5 Jan 2020 03:38:51 +0000 (19:38 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "17 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  hexagon: define ioremap_uc
  ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less
  ocfs2: call journal flush to mark journal as empty after journal recovery when mount
  mm/hugetlb: defer freeing of huge pages if in non-task context
  mm/gup: fix memory leak in __gup_benchmark_ioctl
  mm/oom: fix pgtables units mismatch in Killed process message
  fs/posix_acl.c: fix kernel-doc warnings
  hexagon: work around compiler crash
  hexagon: parenthesize registers in asm predicates
  fs/namespace.c: make to_mnt_ns() static
  fs/nsfs.c: include headers for missing declarations
  fs/direct-io.c: include fs/internal.h for missing prototype
  mm: move_pages: return valid node id in status if the page is already on the target node
  memcg: account security cred as well to kmemcg
  kcov: fix struct layout for kcov_remote_arg
  mm/zsmalloc.c: fix the migrated zspage statistics.
  mm/memory_hotplug: shrink zones when offlining memory

4 years agoMerge tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 5 Jan 2020 03:28:30 +0000 (19:28 -0800)]
Merge tag 'apparmor-pr-2020-01-04' of git://git./linux/kernel/git/jj/linux-apparmor

Pull apparmor fixes from John Johansen:

 - performance regression: only get a label reference if the fast path
   check fails

 - fix aa_xattrs_match() may sleep while holding a RCU lock

 - fix bind mounts aborting with -ENOMEM

* tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
  apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
  apparmor: only get a label reference if the fast path check fails
  apparmor: fix bind mounts aborting with -ENOMEM

4 years agoapparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
John Johansen [Thu, 2 Jan 2020 13:31:22 +0000 (05:31 -0800)]
apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock

aa_xattrs_match() is unfortunately calling vfs_getxattr_alloc() from a
context protected by an rcu_read_lock. This can not be done as
vfs_getxattr_alloc() may sleep regardles of the gfp_t value being
passed to it.

Fix this by breaking the rcu_read_lock on the policy search when the
xattr match feature is requested and restarting the search if a policy
changes occur.

Fixes: 8e51f9087f40 ("apparmor: Add support for attaching profiles via xattr, presence and value")
Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: John Johansen <john.johansen@canonical.com>
4 years agoMerge tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Sat, 4 Jan 2020 22:16:57 +0000 (14:16 -0800)]
Merge tag 'mips_fixes_5.5_1' of git://git./linux/kernel/git/mips/linux

Pull MIPS fixes from Paul Burton:
 "A collection of MIPS fixes:

   - Fill the struct cacheinfo shared_cpu_map field with sensible
     values, notably avoiding issues with perf which was unhappy in the
     absence of these values.

   - A boot fix for Loongson 2E & 2F machines which was fallout from
     some refactoring performed this cycle.

   - A Kconfig dependency fix for the Loongson CPU HWMon driver.

   - A couple of VDSO fixes, ensuring gettimeofday() behaves
     appropriately for kernel configurations that don't include support
     for a clocksource the VDSO can use & fixing the calling convention
     for the n32 & n64 VDSOs which would previously clobber the $gp/$28
     register.

   - A build fix for vmlinuz compressed images which were
     inappropriately building with -fsanitize-coverage despite not being
     part of the kernel proper, then failing to link due to the missing
     __sanitizer_cov_trace_pc() function.

   - A couple of eBPF JIT fixes, including disabling it for MIPS32 due
     to a large number of issues with the code generated there &
     reflecting ISA dependencies in Kconfig to enforce that systems
     which don't support the JIT must include the interpreter"

* tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: Avoid VDSO ABI breakage due to global register variable
  MIPS: BPF: eBPF JIT: check for MIPS ISA compliance in Kconfig
  MIPS: BPF: Disable MIPS32 eBPF JIT
  MIPS: Prevent link failure with kcov instrumentation
  MIPS: Kconfig: Use correct form for 'depends on'
  mips: Fix gettimeofday() in the vdso library
  MIPS: Fix boot on Fuloong2 systems
  mips: cacheinfo: report shared CPU map

4 years agohexagon: define ioremap_uc
Nick Desaulniers [Sat, 4 Jan 2020 21:00:26 +0000 (13:00 -0800)]
hexagon: define ioremap_uc

Similar to commit 38e45d81d14e ("sparc64: implement ioremap_uc") define
ioremap_uc for hexagon to avoid errors from
-Wimplicit-function-definition.

Link: http://lkml.kernel.org/r/20191209222956.239798-2-ndesaulniers@google.com
Link: https://github.com/ClangBuiltLinux/linux/issues/797
Fixes: e537654b7039 ("lib: devres: add a helper function for ioremap_uc")
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Tuowen Zhao <ztuowen@gmail.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexios Zavras <alexios.zavras@intel.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Will Deacon <will@kernel.org>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoocfs2: fix the crash due to call ocfs2_get_dlm_debug once less
Gang He [Sat, 4 Jan 2020 21:00:22 +0000 (13:00 -0800)]
ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less

Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
file system will trigger the system crash, usually after ocfs2 file
system is unmounted.

This system crash is caused by a generic memory corruption, these crash
backtraces are not always the same, for exapmle,

    ocfs2: Unmounting device (253,16) on (node 172167785)
    general protection fault: 0000 [#1] SMP PTI
    CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    RIP: 0010:__kmalloc+0xa5/0x2a0
    Code: 00 00 4d 8b 07 65 4d 8b
    RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
    RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
    R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
    R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
    FS:  00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
    Call Trace:
     ext4_htree_store_dirent+0x35/0x100 [ext4]
     htree_dirblock_to_tree+0xea/0x290 [ext4]
     ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
     ext4_readdir+0x67c/0x9d0 [ext4]
     iterate_dir+0x8d/0x1a0
     __x64_sys_getdents+0xab/0x130
     do_syscall_64+0x60/0x1f0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f699d33a9fb

This regression problem was introduced by commit e581595ea29c ("ocfs: no
need to check return value of debugfs_create functions").

Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com
Fixes: e581595ea29c ("ocfs: no need to check return value of debugfs_create functions")
Signed-off-by: Gang He <ghe@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org> [5.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoocfs2: call journal flush to mark journal as empty after journal recovery when mount
Kai Li [Sat, 4 Jan 2020 21:00:18 +0000 (13:00 -0800)]
ocfs2: call journal flush to mark journal as empty after journal recovery when mount

If journal is dirty when mount, it will be replayed but jbd2 sb log tail
cannot be updated to mark a new start because journal->j_flag has
already been set with JBD2_ABORT first in journal_init_common.

When a new transaction is committed, it will be recored in block 1
first(journal->j_tail is set to 1 in journal_reset).  If emergency
restart happens again before journal super block is updated
unfortunately, the new recorded trans will not be replayed in the next
mount.

The following steps describe this procedure in detail.
1. mount and touch some files
2. these transactions are committed to journal area but not checkpointed
3. emergency restart
4. mount again and its journals are replayed
5. journal super block's first s_start is 1, but its s_seq is not updated
6. touch a new file and its trans is committed but not checkpointed
7. emergency restart again
8. mount and journal is dirty, but trans committed in 6 will not be
replayed.

This exception happens easily when this lun is used by only one node.
If it is used by multi-nodes, other node will replay its journal and its
journal super block will be updated after recovery like what this patch
does.

ocfs2_recover_node->ocfs2_replay_journal.

The following jbd2 journal can be generated by touching a new file after
journal is replayed, and seq 15 is the first valid commit, but first seq
is 13 in journal super block.

logdump:
  Block 0: Journal Superblock
  Seq: 0   Type: 4 (JBD2_SUPERBLOCK_V2)
  Blocksize: 4096   Total Blocks: 32768   First Block: 1
  First Commit ID: 13   Start Log Blknum: 1
  Error: 0
  Feature Compat: 0
  Feature Incompat: 2 block64
  Feature RO compat: 0
  Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
  FS Share Cnt: 1   Dynamic Superblk Blknum: 0
  Per Txn Block Limit    Journal: 0    Data: 0

  Block 1: Journal Commit Block
  Seq: 14   Type: 2 (JBD2_COMMIT_BLOCK)

  Block 2: Journal Descriptor
  Seq: 15   Type: 1 (JBD2_DESCRIPTOR_BLOCK)
  No. Blocknum        Flags
   0. 587             none
  UUID: 00000000000000000000000000000000
   1. 8257792         JBD2_FLAG_SAME_UUID
   2. 619             JBD2_FLAG_SAME_UUID
   3. 24772864        JBD2_FLAG_SAME_UUID
   4. 8257802         JBD2_FLAG_SAME_UUID
   5. 513             JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
  ...
  Block 7: Inode
  Inode: 8257802   Mode: 0640   Generation: 57157641 (0x3682809)
  FS Generation: 2839773110 (0xa9437fb6)
  CRC32: 00000000   ECC: 0000
  Type: Regular   Attr: 0x0   Flags: Valid
  Dynamic Features: (0x1) InlineData
  User: 0 (root)   Group: 0 (root)   Size: 7
  Links: 1   Clusters: 0
  ctime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
  atime: 0x5de5d870 0x113181a1 -- Tue Dec  3 11:37:20.288457121 2019
  mtime: 0x5de5d870 0x11104c61 -- Tue Dec  3 11:37:20.286280801 2019
  dtime: 0x0 -- Thu Jan  1 08:00:00 1970
  ...
  Block 9: Journal Commit Block
  Seq: 15   Type: 2 (JBD2_COMMIT_BLOCK)

The following is journal recovery log when recovering the upper jbd2
journal when mount again.

syslog:
  ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
  fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
  fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
  fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
  fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13

Due to first commit seq 13 recorded in journal super is not consistent
with the value recorded in block 1(seq is 14), journal recovery will be
terminated before seq 15 even though it is an unbroken commit, inode
8257802 is a new file and it will be lost.

Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com
Signed-off-by: Kai Li <li.kai4@h3c.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Changwei Ge <gechangwei@live.cn>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/hugetlb: defer freeing of huge pages if in non-task context
Waiman Long [Sat, 4 Jan 2020 21:00:15 +0000 (13:00 -0800)]
mm/hugetlb: defer freeing of huge pages if in non-task context

The following lockdep splat was observed when a certain hugetlbfs test
was run:

  ================================
  WARNING: inconsistent lock state
  4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
  --------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
  {SOFTIRQ-ON-W} state was registered at:
    lock_acquire+0x14f/0x3b0
    _raw_spin_lock+0x30/0x70
    __nr_hugepages_store_common+0x11b/0xb30
    hugetlb_sysctl_handler_common+0x209/0x2d0
    proc_sys_call_handler+0x37f/0x450
    vfs_write+0x157/0x460
    ksys_write+0xb8/0x170
    do_syscall_64+0xa5/0x4d0
    entry_SYSCALL_64_after_hwframe+0x6a/0xdf
  irq event stamp: 691296
  hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
  hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
  softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
  softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(hugetlb_lock);
    <Interrupt>
      lock(hugetlb_lock);

   *** DEADLOCK ***
      :
  Call Trace:
   <IRQ>
   __lock_acquire+0x146b/0x48c0
   lock_acquire+0x14f/0x3b0
   _raw_spin_lock+0x30/0x70
   free_huge_page+0x36f/0xaa0
   bio_check_pages_dirty+0x2fc/0x5c0
   clone_endio+0x17f/0x670 [dm_mod]
   blk_update_request+0x276/0xe50
   scsi_end_request+0x7b/0x6a0
   scsi_io_completion+0x1c6/0x1570
   blk_done_softirq+0x22e/0x350
   __do_softirq+0x23d/0xad8
   irq_exit+0x23e/0x2b0
   do_IRQ+0x11a/0x200
   common_interrupt+0xf/0xf
   </IRQ>

Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page().  One way to solve the problem is to make both locks
irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
held for a linear scan of ALL hugetlb pages during a cgroup reparentling
operation.  So it is just too long to have irq disabled unless we can
break hugetbl_lock down into finer-grained locks with shorter lock hold
times.

Another alternative is to defer the freeing to a workqueue job.  This
patch implements the deferred freeing by adding a free_hpage_workfn()
work function to do the actual freeing.  The free_huge_page() call in a
non-task context saves the page to be freed in the hpage_freelist linked
list in a lockless manner using the llist APIs.

The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.

Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
llist APIs which simplfy the code.

Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/gup: fix memory leak in __gup_benchmark_ioctl
Navid Emamdoost [Sat, 4 Jan 2020 21:00:12 +0000 (13:00 -0800)]
mm/gup: fix memory leak in __gup_benchmark_ioctl

In the implementation of __gup_benchmark_ioctl() the allocated pages
should be released before returning in case of an invalid cmd.  Release
pages via kvfree().

[akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods")
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/oom: fix pgtables units mismatch in Killed process message
Ilya Dryomov [Sat, 4 Jan 2020 21:00:09 +0000 (13:00 -0800)]
mm/oom: fix pgtables units mismatch in Killed process message

pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
As everything else is printed in kB, I chose to fix the value rather than
the string.

Before:

[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
[   1878]  1000  1878   217253   151144  1269760        0             0 python
...
Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

After:

[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
[   1436]  1000  1436   217253   151890  1294336        0             0 python
...
Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Edward Chron <echron@arista.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agofs/posix_acl.c: fix kernel-doc warnings
Randy Dunlap [Sat, 4 Jan 2020 21:00:05 +0000 (13:00 -0800)]
fs/posix_acl.c: fix kernel-doc warnings

Fix kernel-doc warnings in fs/posix_acl.c.
Also fix one typo (setgit -> setgid).

  fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
  fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
  fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'

Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org
Fixes: 073931017b49d ("posix_acl: Clear SGID bit when setting file permissions")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agohexagon: work around compiler crash
Nick Desaulniers [Sat, 4 Jan 2020 21:00:02 +0000 (13:00 -0800)]
hexagon: work around compiler crash

Clang cannot translate the string "r30" into a valid register yet.

Link: https://github.com/ClangBuiltLinux/linux/issues/755
Link: http://lkml.kernel.org/r/20191028155722.23419-1-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Sid Manning <sidneym@quicinc.com>
Reviewed-by: Brian Cain <bcain@codeaurora.org>
Cc: Allison Randal <allison@lohutok.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agohexagon: parenthesize registers in asm predicates
Nick Desaulniers [Sat, 4 Jan 2020 20:59:59 +0000 (12:59 -0800)]
hexagon: parenthesize registers in asm predicates

Hexagon requires that register predicates in assembly be parenthesized.

Link: https://github.com/ClangBuiltLinux/linux/issues/754
Link: http://lkml.kernel.org/r/20191209222956.239798-3-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Sid Manning <sidneym@codeaurora.org>
Acked-by: Brian Cain <bcain@codeaurora.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Tuowen Zhao <ztuowen@gmail.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexios Zavras <alexios.zavras@intel.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Will Deacon <will@kernel.org>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agofs/namespace.c: make to_mnt_ns() static
Eric Biggers [Sat, 4 Jan 2020 20:59:55 +0000 (12:59 -0800)]
fs/namespace.c: make to_mnt_ns() static

Make to_mnt_ns() static to address the following 'sparse' warning:

    fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agofs/nsfs.c: include headers for missing declarations
Eric Biggers [Sat, 4 Jan 2020 20:59:52 +0000 (12:59 -0800)]
fs/nsfs.c: include headers for missing declarations

Include linux/proc_fs.h and fs/internal.h to address the following
'sparse' warnings:

    fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
    fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agofs/direct-io.c: include fs/internal.h for missing prototype
Eric Biggers [Sat, 4 Jan 2020 20:59:49 +0000 (12:59 -0800)]
fs/direct-io.c: include fs/internal.h for missing prototype

Include fs/internal.h to address the following 'sparse' warning:

    fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm: move_pages: return valid node id in status if the page is already on the target...
Yang Shi [Sat, 4 Jan 2020 20:59:46 +0000 (12:59 -0800)]
mm: move_pages: return valid node id in status if the page is already on the target node

Felix Abecassis reports move_pages() would return random status if the
pages are already on the target node by the below test program:

  int main(void)
  {
const long node_id = 1;
const long page_size = sysconf(_SC_PAGESIZE);
const int64_t num_pages = 8;

unsigned long nodemask =  1 << node_id;
long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
if (ret < 0)
return (EXIT_FAILURE);

void **pages = malloc(sizeof(void*) * num_pages);
for (int i = 0; i < num_pages; ++i) {
pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
-1, 0);
if (pages[i] == MAP_FAILED)
return (EXIT_FAILURE);
}

ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
if (ret < 0)
return (EXIT_FAILURE);

int *nodes = malloc(sizeof(int) * num_pages);
int *status = malloc(sizeof(int) * num_pages);
for (int i = 0; i < num_pages; ++i) {
nodes[i] = node_id;
status[i] = 0xd0; /* simulate garbage values */
}

ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
printf("move_pages: %ld\n", ret);
for (int i = 0; i < num_pages; ++i)
printf("status[%d] = %d\n", i, status[i]);
  }

Then running the program would return nonsense status values:

  $ ./move_pages_bug
  move_pages: 0
  status[0] = 208
  status[1] = 208
  status[2] = 208
  status[3] = 208
  status[4] = 208
  status[5] = 208
  status[6] = 208
  status[7] = 208

This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds.  The valid status may be errno or node id.

We can't simply initialize status array to zero since the pages may be
not on node 0.  Fix it by updating status with node id which the page is
already on.

Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Tested-by: Felix Abecassis <fabecassis@nvidia.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomemcg: account security cred as well to kmemcg
Shakeel Butt [Sat, 4 Jan 2020 20:59:43 +0000 (12:59 -0800)]
memcg: account security cred as well to kmemcg

The cred_jar kmem_cache is already memcg accounted in the current kernel
but cred->security is not.  Account cred->security to kmemcg.

Recently we saw high root slab usage on our production and on further
inspection, we found a buggy application leaking processes.  Though that
buggy application was contained within its memcg but we observe much
more system memory overhead, couple of GiBs, during that period.  This
overhead can adversely impact the isolation on the system.

One source of high overhead we found was cred->security objects, which
have a lifetime of at least the life of the process which allocated
them.

Link: http://lkml.kernel.org/r/20191205223721.40034-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agokcov: fix struct layout for kcov_remote_arg
Andrey Konovalov [Sat, 4 Jan 2020 20:59:39 +0000 (12:59 -0800)]
kcov: fix struct layout for kcov_remote_arg

Make the layout of kcov_remote_arg the same for 32-bit and 64-bit code.
This makes it more convenient to write userspace apps that can be
compiled into 32-bit or 64-bit binaries and still work with the same
64-bit kernel.

Also use proper __u32 types in uapi headers instead of unsigned ints.

Link: http://lkml.kernel.org/r/9e91020876029cfefc9211ff747685eba9536426.1575638983.git.andreyknvl@google.com
Fixes: eec028c9386ed1a ("kcov: remote coverage support")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Chunfeng Yun <chunfeng.yun@mediatek.com>
Cc: "Jacky . Cao @ sony . com" <Jacky.Cao@sony.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/zsmalloc.c: fix the migrated zspage statistics.
Chanho Min [Sat, 4 Jan 2020 20:59:36 +0000 (12:59 -0800)]
mm/zsmalloc.c: fix the migrated zspage statistics.

When zspage is migrated to the other zone, the zone page state should be
updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
counts including proc/zoneinfo in practice.

Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com
Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat")
Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Jinsuk Choi <jjinsuk.choi@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agomm/memory_hotplug: shrink zones when offlining memory
David Hildenbrand [Sat, 4 Jan 2020 20:59:33 +0000 (12:59 -0800)]
mm/memory_hotplug: shrink zones when offlining memory

We currently try to shrink a single zone when removing memory.  We use
the zone of the first page of the memory we are removing.  If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):

    BUG: unable to handle page fault for address: 000000000000353d
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0002) - not-present page
    PGD 0 P4D 0
    Oops: 0002 [#1] SMP PTI
    CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
    Workqueue: kacpi_hotplug acpi_hotplug_work_fn
    RIP: 0010:clear_zone_contiguous+0x5/0x10
    Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
    RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
    RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
    RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
    R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
    FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     __remove_pages+0x4b/0x640
     arch_remove_memory+0x63/0x8d
     try_remove_memory+0xdb/0x130
     __remove_memory+0xa/0x11
     acpi_memory_device_remove+0x70/0x100
     acpi_bus_trim+0x55/0x90
     acpi_device_hotplug+0x227/0x3a0
     acpi_hotplug_work_fn+0x1a/0x30
     process_one_work+0x221/0x550
     worker_thread+0x50/0x3b0
     kthread+0x105/0x140
     ret_from_fork+0x3a/0x50
    Modules linked in:
    CR2: 000000000000353d

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that.  We now
properly shrink the zones, even if we have DIMMs whereby

 - Some memory blocks fall into no zone (never onlined)

 - Some memory blocks fall into multiple zones (offlined+re-onlined)

 - Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoMerge tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Sat, 4 Jan 2020 18:49:15 +0000 (10:49 -0800)]
Merge tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:
 "A bunch of fixes for:

   - uninitialized dma_slave_caps access

   - virt-dma use after free in vchan_complete()

   - driver fixes for ioat, k3dma and jz4780"

* tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
  ioat: ioat_alloc_ring() failure handling.
  dmaengine: virt-dma: Fix access after free in vchan_complete()
  dmaengine: k3dma: Avoid null pointer traversal
  dmaengine: dma-jz4780: Also break descriptor chains on JZ4725B
  dmaengine: Fix access to uninitialized dma_slave_caps

4 years agoMerge tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Sat, 4 Jan 2020 18:41:08 +0000 (10:41 -0800)]
Merge tag 'media/v5.5-3' of git://git./linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

 - some fixes at CEC core to comply with HDMI 2.0 specs and fix some
   border cases

 - a fix at the transmission logic of the pulse8-cec driver

 - one alignment fix on a data struct at ipu3 when built with 32 bits

* tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes
  media: pulse8-cec: fix lost cec_transmit_attempt_done() call
  media: cec: check 'transmit_in_progress', not 'transmitting'
  media: cec: avoid decrementing transmit_queue_sz if it is 0
  media: cec: CEC 2.0-only bcast messages were ignored

4 years agoMerge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Fri, 3 Jan 2020 20:20:21 +0000 (12:20 -0800)]
Merge tag 'for-5.5-rc4-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes for btrfs:

   - blkcg accounting problem with compression that could stall writes

   - setting up blkcg bio for compression crashes due to NULL bdev
     pointer

   - fix possible infinite loop in writeback for nocow files (here
     possible means almost impossible, 13 things that need to happen to
     trigger it)"

* tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix infinite loop during nocow writeback due to race
  btrfs: fix compressed write bio blkcg attribution
  btrfs: punt all bios created in btrfs_submit_compressed_write()

4 years agoMerge tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block
Linus Torvalds [Fri, 3 Jan 2020 20:11:30 +0000 (12:11 -0800)]
Merge tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Three fixes in here:

   - Fix for a missing split on default memory boundary mask (4G) (Ming)

   - Fix for multi-page read bio truncate (Ming)

   - Fix for null_blk zone close request handling (Damien)"

* tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block:
  null_blk: Fix REQ_OP_ZONE_CLOSE handling
  block: fix splitting segments on boundary masks
  block: add bio_truncate to fix guard_bio_eod

4 years agoMerge tag 'kbuild-fixes-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 3 Jan 2020 19:21:25 +0000 (11:21 -0800)]
Merge tag 'kbuild-fixes-v5.5-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - fix build error in usr/gen_initramfs_list.sh

 - fix libelf-dev dependency in deb-pkg build

* tag 'kbuild-fixes-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild/deb-pkg: annotate libelf-dev dependency as :native
  gen_initramfs_list.sh: fix 'bad variable name' error

4 years agoMerge tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 3 Jan 2020 19:17:14 +0000 (11:17 -0800)]
Merge tag 'for-linus-2020-01-03' of git://git./linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "Here are two fixes:

   - Panic earlier when global init exits to generate useable coredumps.

     Currently, when global init and all threads in its thread-group
     have exited we panic via:

       do_exit()
       -> exit_notify()
          -> forget_original_parent()
             -> find_child_reaper()

     This makes it hard to extract a useable coredump for global init
     from a kernel crashdump because by the time we panic exit_mm() will
     have already released global init's mm. We now panic slightly
     earlier. This has been a problem in certain environments such as
     Android.

   - Fix a race in assigning and reading taskstats for thread-groups
     with more than one thread.

     This patch has been waiting for quite a while since people
     disagreed on what the correct fix was at first"

* tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  exit: panic before exit_mm() on global init exit
  taskstats: fix data-race

4 years agoMerge tag 'powerpc-5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Fri, 3 Jan 2020 19:13:50 +0000 (11:13 -0800)]
Merge tag 'powerpc-5.5-5' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Two more powerpc fixes for 5.5:

   - One commit to fix a build error when CONFIG_JUMP_LABEL=n,
     introduced by our recent fix to is_shared_processor().

   - A commit marking some SLB related functions as notrace, as tracing
     them triggers warnings.

  Thanks to Jason A Donenfeld"

* tag 'powerpc-5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/spinlocks: Include correct header for static key
  powerpc/mm: Mark get_slice_psize() & slice_addr_is_low() as notrace

4 years agoMerge tag 'sound-5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai...
Linus Torvalds [Fri, 3 Jan 2020 19:10:31 +0000 (11:10 -0800)]
Merge tag 'sound-5.5-rc5' of git://git./linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Nothing to worry at this stage but all nice small changes:

   - A regression fix for AMD GPU detection in HD-audio

   - A long-standing sleep-in-atomic fix for an ice1724 device

   - Usual suspects, the device-specific quirks for HD- and USB-audio"

* tag 'sound-5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek - Enable the bass speaker of ASUS UX431FLC
  ALSA: ice1724: Fix sleep-in-atomic in Infrasonic Quartet support code
  ALSA: hda/realtek - Add Bass Speaker and fixed dac for bass speaker
  ALSA: hda - Apply sync-write workaround to old Intel platforms, too
  ALSA: hda/hdmi - fix atpx_present when CLASS is not VGA
  ALSA: usb-audio: fix set_format altsetting sanity check
  ALSA: hda/realtek - Add headset Mic no shutup for ALC283
  ALSA: usb-audio: set the interface format after resume on Dell WD19

4 years agoMerge tag 'drm-fixes-2020-01-03' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 3 Jan 2020 19:08:30 +0000 (11:08 -0800)]
Merge tag 'drm-fixes-2020-01-03' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "New Years fixes! Mostly amdgpu with a light smattering of arm
  graphics, and two AGP warning fixes.

  Quiet as expected, hopefully we don't get a post holiday rush.

  agp:
   - two unused variable removed

  amdgpu:
   - ATPX regression fix
   - SMU metrics table locking fixes
   - gfxoff fix for raven
   - RLC firmware loading stability fix

  mediatek:
   - external display fix
   - dsi timing fix

  sun4i:
   - Fix double-free in connector/encoder cleanup (Stefan)

  maildp:
   - Make vtable static (Ben)"

* tag 'drm-fixes-2020-01-03' of git://anongit.freedesktop.org/drm/drm:
  agp: remove unused variable arqsz in agp_3_5_enable()
  agp: remove unused variable mcapndx
  drm/amdgpu: correct RLC firmwares loading sequence
  drm/amdgpu: enable gfxoff for raven1 refresh
  drm/amdgpu/smu: add metrics table lock for vega20 (v2)
  drm/amdgpu/smu: add metrics table lock for navi (v2)
  drm/amdgpu/smu: add metrics table lock for arcturus (v2)
  drm/amdgpu/smu: add metrics table lock
  Revert "drm/amdgpu: simplify ATPX detection"
  drm/arm/mali: make malidp_mw_connector_helper_funcs static
  drm/sun4i: hdmi: Remove duplicate cleanup calls
  drm/mediatek: reduce the hbp and hfp for phy timing
  drm/mediatek: Fix can't get component for external display plane.
  drm/mediatek: Check return value of mtk_drm_ddp_comp_for_plane.

4 years agomm/hugetlbfs: fix for_each_hstate() loop in init_hugetlbfs_fs()
Jan Stancek [Fri, 3 Jan 2020 17:37:18 +0000 (18:37 +0100)]
mm/hugetlbfs: fix for_each_hstate() loop in init_hugetlbfs_fs()

LTP memfd_create04 started failing for some huge page sizes
after v5.4-10135-gc3bfc5dd73c6.

The problem is the check introduced to for_each_hstate() loop that
should skip default_hstate_idx.  Since it doesn't update 'i' counter,
all subsequent huge page sizes are skipped as well.

Fixes: 8fc312b32b25 ("mm/hugetlbfs: fix error handling when setting up mounts")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agokbuild/deb-pkg: annotate libelf-dev dependency as :native
Ard Biesheuvel [Mon, 30 Dec 2019 14:07:47 +0000 (15:07 +0100)]
kbuild/deb-pkg: annotate libelf-dev dependency as :native

Cross compiling the x86 kernel on a non-x86 build machine produces
the following error when CONFIG_UNWINDER_ORC is enabled, regardless
of whether libelf-dev is installed or not.

  dpkg-checkbuilddeps: error: Unmet build dependencies: libelf-dev
  dpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting
  dpkg-buildpackage: warning: (Use -d flag to override.)

Since this is a build time dependency for a build tool, we need to
depend on the native version of libelf-dev so add the appropriate
annotation.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
4 years agogen_initramfs_list.sh: fix 'bad variable name' error
Masahiro Yamada [Mon, 30 Dec 2019 13:20:06 +0000 (22:20 +0900)]
gen_initramfs_list.sh: fix 'bad variable name' error

Prior to commit 858805b336be ("kbuild: add $(BASH) to run scripts with
bash-extension"), this shell script was almost always run by bash since
bash is usually installed on the system by default.

Now, this script is run by sh, which might be a symlink to dash. On such
distributions, the following code emits an error:

  local dev=`LC_ALL=C ls -l "${location}"`

You can reproduce the build error, for example by setting
CONFIG_INITRAMFS_SOURCE="/dev".

    GEN     usr/initramfs_data.cpio.gz
  ./usr/gen_initramfs_list.sh: 131: local: 1: bad variable name
  make[1]: *** [usr/Makefile:61: usr/initramfs_data.cpio.gz] Error 2

This is because `LC_ALL=C ls -l "${location}"` contains spaces.
Surrounding it with double-quotes fixes the error.

Fixes: 858805b336be ("kbuild: add $(BASH) to run scripts with bash-extension")
Reported-by: Jory A. Pratt <anarchy@gentoo.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
4 years agomedia: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes
Sakari Ailus [Wed, 6 Nov 2019 11:57:07 +0000 (12:57 +0100)]
media: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes

A struct that needs to be aligned to 32 bytes has a size of 28. Increase
the size to 32.

This makes elements of arrays of this struct aligned to 32 as well, and
other structs where members are aligned to 32 mixing
ipu3_uapi_awb_fr_config_s as well as other types.

Fixes: commit dca5ef2aa1e6 ("media: staging/intel-ipu3: remove the unnecessary compiler flags")
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Tested-by: Bingbu Cao <bingbu.cao@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
4 years agoriscv: ftrace: correct the condition logic in function graph tracer
Zong Li [Mon, 23 Dec 2019 08:46:13 +0000 (16:46 +0800)]
riscv: ftrace: correct the condition logic in function graph tracer

The condition should be logical NOT to assign the hook address to parent
address. Because the return value 0 of function_graph_enter upon
success.

Fixes: e949b6db51dc (riscv/function_graph: Simplify with function_graph_enter())
Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoriscv: dts: Add DT support for SiFive L2 cache controller
Yash Shah [Fri, 3 Jan 2020 04:13:20 +0000 (09:43 +0530)]
riscv: dts: Add DT support for SiFive L2 cache controller

Add the L2 cache controller DT node in SiFive FU540 soc-specific DT file

Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoriscv: gcov: enable gcov for RISC-V
Zong Li [Thu, 2 Jan 2020 03:09:54 +0000 (11:09 +0800)]
riscv: gcov: enable gcov for RISC-V

This patch enables GCOV code coverage measurement on RISC-V.
Lightly tested on QEMU and Hifive Unleashed board, seems to work as
expected.

Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoriscv: mm: use __pa_symbol for kernel symbols
Zong Li [Thu, 2 Jan 2020 03:12:40 +0000 (11:12 +0800)]
riscv: mm: use __pa_symbol for kernel symbols

__pa_symbol is the marcro that should be used for kernel symbols. It is
also a pre-requisite for DEBUG_VIRTUAL which will do bounds checking.

Signed-off-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
4 years agoagp: remove unused variable arqsz in agp_3_5_enable()
Yunfeng Ye [Tue, 17 Dec 2019 12:22:57 +0000 (20:22 +0800)]
agp: remove unused variable arqsz in agp_3_5_enable()

This patch fix the following warning:
drivers/char/agp/isoch.c: In function ‘agp_3_5_enable’:
drivers/char/agp/isoch.c:322:13: warning: variable ‘arqsz’ set but not
used [-Wunused-but-set-variable]
  u32 isoch, arqsz;
             ^~~~~

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
4 years agoagp: remove unused variable mcapndx
Yunfeng Ye [Tue, 17 Dec 2019 12:21:37 +0000 (20:21 +0800)]
agp: remove unused variable mcapndx

This patch fix the following warning:
drivers/char/agp/isoch.c: In function ‘agp_3_5_isochronous_node_enable’:
drivers/char/agp/isoch.c:87:5: warning: variable ‘mcapndx’ set but not
used [-Wunused-but-set-variable]
  u8 mcapndx;
     ^~~~~~~

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
4 years agoMerge tag 'sizeof_field-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 3 Jan 2020 01:04:43 +0000 (17:04 -0800)]
Merge tag 'sizeof_field-v5.5-rc5' of git://git./linux/kernel/git/kees/linux

Pull final sizeof_field conversion from Kees Cook:
 "Remove now unused FIELD_SIZEOF() macro (Kees Cook)"

* tag 'sizeof_field-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kernel.h: Remove unused FIELD_SIZEOF()

4 years agoMerge tag 'gcc-plugins-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 3 Jan 2020 00:46:30 +0000 (16:46 -0800)]
Merge tag 'gcc-plugins-v5.5-rc5' of git://git./linux/kernel/git/kees/linux

Pull gcc-plugins fix from Kees Cook:
 "Build flexibility fix: allow builds to disable plugins even when
  plugins available (Arnd Bergmann)"

* tag 'gcc-plugins-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  gcc-plugins: make it possible to disable CONFIG_GCC_PLUGINS again

4 years agoMerge tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees...
Linus Torvalds [Fri, 3 Jan 2020 00:42:10 +0000 (16:42 -0800)]
Merge tag 'seccomp-v5.5-rc5' of git://git./linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "Fixes for seccomp_notify_ioctl uapi sanity from Sargun Dhillon.

  The bulk of this is fixing the surrounding samples and selftests so
  that seccomp can correctly validate the seccomp_notify_ioctl buffer as
  being initially zeroed.

  Summary:

   - Fix samples and selftests to zero passed-in buffer

   - Enforce zeroed buffer checking

   - Verify buffer sanity check in selftest"

* tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV
  seccomp: Check that seccomp_notif is zeroed out by the user
  selftests/seccomp: Zero out seccomp_notif
  samples/seccomp: Zero out members based on seccomp_notif_sizes

4 years agoMIPS: Avoid VDSO ABI breakage due to global register variable
Paul Burton [Thu, 2 Jan 2020 04:50:38 +0000 (20:50 -0800)]
MIPS: Avoid VDSO ABI breakage due to global register variable

Declaring __current_thread_info as a global register variable has the
effect of preventing GCC from saving & restoring its value in cases
where the ABI would typically do so.

To quote GCC documentation:

> If the register is a call-saved register, call ABI is affected: the
> register will not be restored in function epilogue sequences after the
> variable has been assigned. Therefore, functions cannot safely return
> to callers that assume standard ABI.

When our position independent VDSO is built for the n32 or n64 ABIs all
functions it exposes should be preserving the value of $gp/$28 for their
caller, but in the presence of the __current_thread_info global register
variable GCC stops doing so & simply clobbers $gp/$28 when calculating
the address of the GOT.

In cases where the VDSO returns success this problem will typically be
masked by the caller in libc returning & restoring $gp/$28 itself, but
that is by no means guaranteed. In cases where the VDSO returns an error
libc will typically contain a fallback path which will now fail
(typically with a bad memory access) if it attempts anything which
relies upon the value of $gp/$28 - eg. accessing anything via the GOT.

One fix for this would be to move the declaration of
__current_thread_info inside the current_thread_info() function,
demoting it from global register variable to local register variable &
avoiding inadvertently creating a non-standard calling ABI for the VDSO.
Unfortunately this causes issues for clang, which doesn't support local
register variables as pointed out by commit fe92da0f355e ("MIPS: Changed
current_thread_info() to an equivalent supported by both clang and GCC")
which introduced the global register variable before we had a VDSO to
worry about.

Instead, fix this by continuing to use the global register variable for
the kernel proper but declare __current_thread_info as a simple extern
variable when building the VDSO. It should never be referenced, and will
cause a link error if it is. This resolves the calling convention issue
for the VDSO without having any impact upon the build of the kernel
itself for either clang or gcc.

Signed-off-by: Paul Burton <paulburton@kernel.org>
Fixes: ebb5e78cc634 ("MIPS: Initial implementation of a VDSO")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <christian.brauner@canonical.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org> # v4.4+
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
4 years agoMerge tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees...
Linus Torvalds [Fri, 3 Jan 2020 00:39:51 +0000 (16:39 -0800)]
Merge tag 'pstore-v5.5-rc5' of git://git./linux/kernel/git/kees/linux

Pull pstore bug fixes from Kees Cook:

 - always reset circular buffer state when writing new dump (Aleksandr
   Yashkin)

 - fix rare error-path memory leak (Kees Cook)

* tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Write new dumps to start of recycled zones
  pstore/ram: Fix error-path memory leak in persistent_ram_new() callers

4 years agoRevert "fs: remove ksys_dup()"
Dominik Brodowski [Wed, 1 Jan 2020 19:05:03 +0000 (20:05 +0100)]
Revert "fs: remove ksys_dup()"

This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
handling when opening /dev/console").

Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
caused more trouble than what is worth it: it requires accessing vfs
internals and it turns out there were other bugs in it too.

In particular, the file reference counting was wrong - because unlike
the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
thus had an extra leaked file reference.

That in turn then caused odd problems with Androidx86 long after boot
becaue of how the extra reference to the console kept the session active
even after all file descriptors had been closed.

Reported-by: youling 257 <youling257@gmail.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agogcc-plugins: make it possible to disable CONFIG_GCC_PLUGINS again
Arnd Bergmann [Wed, 11 Dec 2019 13:39:28 +0000 (14:39 +0100)]
gcc-plugins: make it possible to disable CONFIG_GCC_PLUGINS again

I noticed that randconfig builds with gcc no longer produce a lot of
ccache hits, unlike with clang, and traced this back to plugins
now being enabled unconditionally if they are supported.

I am now working around this by adding

   export CCACHE_COMPILERCHECK=/usr/bin/size -A %compiler%

to my top-level Makefile. This changes the heuristic that ccache uses
to determine whether the plugins are the same after a 'make clean'.

However, it also seems that being able to just turn off the plugins is
generally useful, at least for build testing it adds noticeable overhead
but does not find a lot of bugs additional bugs, and may be easier for
ccache users than my workaround.

Fixes: 9f671e58159a ("security: Create "kernel hardening" config area")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20191211133951.401933-1-arnd@arndb.de
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agoselftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV
Sargun Dhillon [Mon, 30 Dec 2019 20:38:11 +0000 (12:38 -0800)]
selftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV

This adds logic to the user_notification_basic test to set a member
of struct seccomp_notif to an invalid value to ensure that the kernel
returns EINVAL if any of the struct seccomp_notif members are set to
invalid values.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191230203811.4996-1-sargun@sargun.me
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agoseccomp: Check that seccomp_notif is zeroed out by the user
Sargun Dhillon [Sun, 29 Dec 2019 06:24:50 +0000 (22:24 -0800)]
seccomp: Check that seccomp_notif is zeroed out by the user

This patch is a small change in enforcement of the uapi for
SECCOMP_IOCTL_NOTIF_RECV ioctl. Specifically, the datastructure which
is passed (seccomp_notif) must be zeroed out. Previously any of its
members could be set to nonsense values, and we would ignore it.

This ensures all fields are set to their zero value.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Tycho Andersen <tycho@tycho.ws>
Link: https://lore.kernel.org/r/20191229062451.9467-2-sargun@sargun.me
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agoselftests/seccomp: Zero out seccomp_notif
Sargun Dhillon [Sun, 29 Dec 2019 06:24:49 +0000 (22:24 -0800)]
selftests/seccomp: Zero out seccomp_notif

The seccomp_notif structure should be zeroed out prior to calling the
SECCOMP_IOCTL_NOTIF_RECV ioctl. Previously, the kernel did not check
whether these structures were zeroed out or not, so these worked.

This patch zeroes out the seccomp_notif data structure prior to calling
the ioctl.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191229062451.9467-1-sargun@sargun.me
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agosamples/seccomp: Zero out members based on seccomp_notif_sizes
Sargun Dhillon [Mon, 30 Dec 2019 20:35:03 +0000 (12:35 -0800)]
samples/seccomp: Zero out members based on seccomp_notif_sizes

The sizes by which seccomp_notif and seccomp_notif_resp are allocated are
based on the SECCOMP_GET_NOTIF_SIZES ioctl. This allows for graceful
extension of these datastructures. If userspace zeroes out the
datastructure based on its version, and it is lagging behind the kernel's
version, it will end up sending trailing garbage. On the other hand,
if it is ahead of the kernel version, it will write extra zero space,
and potentially cause corruption.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Tycho Andersen <tycho@tycho.ws>
Link: https://lore.kernel.org/r/20191230203503.4925-1-sargun@sargun.me
Fixes: fec7b6690541 ("samples: add an example of seccomp user trap")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agopstore/ram: Write new dumps to start of recycled zones
Aleksandr Yashkin [Mon, 23 Dec 2019 13:38:16 +0000 (18:38 +0500)]
pstore/ram: Write new dumps to start of recycled zones

The ram_core.c routines treat przs as circular buffers. When writing a
new crash dump, the old buffer needs to be cleared so that the new dump
doesn't end up in the wrong place (i.e. at the end).

The solution to this problem is to reset the circular buffer state before
writing a new Oops dump.

Signed-off-by: Aleksandr Yashkin <a.yashkin@inango-systems.com>
Signed-off-by: Nikolay Merinov <n.merinov@inango-systems.com>
Signed-off-by: Ariel Gilman <a.gilman@inango-systems.com>
Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com
Fixes: 896fc1f0c4c6 ("pstore/ram: Switch to persistent_ram routines")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agopstore/ram: Fix error-path memory leak in persistent_ram_new() callers
Kees Cook [Mon, 30 Dec 2019 19:48:10 +0000 (11:48 -0800)]
pstore/ram: Fix error-path memory leak in persistent_ram_new() callers

For callers that allocated a label for persistent_ram_new(), if the call
fails, they must clean up the allocation.

Suggested-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Fixes: 1227daa43bce ("pstore/ram: Clarify resource reservation labels")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20191211191353.14385-1-navid.emamdoost@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
4 years agoapparmor: only get a label reference if the fast path check fails
John Johansen [Wed, 18 Dec 2019 19:04:07 +0000 (11:04 -0800)]
apparmor: only get a label reference if the fast path check fails

The common fast path check can be done under rcu_read_lock() and
doesn't need a reference count on the label. Only take a reference
count if entering the slow path.

Fixes reported hackbench regression
  - sha1 79e178a57dae ("Merge tag 'apparmor-pr-2019-12-03' of
    git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor")

  hackbench -l (256000/#grp) -g #grp
   128 groups     19.679 ±0.90%

  - previous sha1 01d1dff64662 ("Merge tag 's390-5.5-2' of
    git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux")

  hackbench -l (256000/#grp) -g #grp
   128 groups     3.1689 ±3.04%

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: bce4e7e9c45e ("apparmor: reduce rcu_read_lock scope for aa_file_perm mediation")
Signed-off-by: John Johansen <john.johansen@canonical.com>
4 years agoapparmor: fix bind mounts aborting with -ENOMEM
Patrick Steinhardt [Wed, 11 Dec 2019 11:44:08 +0000 (12:44 +0100)]
apparmor: fix bind mounts aborting with -ENOMEM

With commit df323337e507 ("apparmor: Use a memory pool instead per-CPU
caches, 2019-05-03"), AppArmor code was converted to use memory pools. In
that conversion, a bug snuck into the code that polices bind mounts that
causes all bind mounts to fail with -ENOMEM, as we erroneously error out
if `aa_get_buffer` returns a pointer instead of erroring out when it
does _not_ return a valid pointer.

Fix the issue by correctly checking for valid pointers returned by
`aa_get_buffer` to fix bind mounts with AppArmor.

Fixes: df323337e507 ("apparmor: Use a memory pool instead per-CPU caches")
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: John Johansen <john.johansen@canonical.com>
4 years agoMerge tag 'amd-drm-fixes-5.5-2020-01-01' of git://people.freedesktop.org/~agd5f/linux...
Dave Airlie [Thu, 2 Jan 2020 00:16:04 +0000 (10:16 +1000)]
Merge tag 'amd-drm-fixes-5.5-2020-01-01' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

amd-drm-fixes-5.5-2020-01-01:

amdgpu:
- ATPX regression fix
- SMU metrics table locking fixes
- gfxoff fix for raven
- RLC firmware loading stability fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexdeucher@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200101151307.5242-1-alexander.deucher@amd.com
4 years agoMerge tag 'drm-misc-fixes-2019-12-31' of git://anongit.freedesktop.org/drm/drm-misc...
Dave Airlie [Wed, 1 Jan 2020 23:40:54 +0000 (09:40 +1000)]
Merge tag 'drm-misc-fixes-2019-12-31' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

-sun4i: Fix double-free in connector/encoder cleanup (Stefan)
-malidp: Make vtable static (Ben)

Cc: Ben Dooks <ben.dooks@codethink.co.uk>
Cc: Stefan Mavrodiev <stefan@olimex.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20191231152503.GA46740@art_vandelay
4 years agoMerge tag 'mediatek-drm-fixes-5.5' of https://github.com/ckhu-mediatek/linux.git...
Dave Airlie [Wed, 1 Jan 2020 23:40:03 +0000 (09:40 +1000)]
Merge tag 'mediatek-drm-fixes-5.5' of https://github.com/ckhu-mediatek/linux.git-tags into drm-fixes

Mediatek DRM fixes for Linux 5.5

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1577762298.23194.2.camel@mtksdaap41
4 years agodrm/amdgpu: correct RLC firmwares loading sequence
Evan Quan [Mon, 23 Dec 2019 08:13:48 +0000 (16:13 +0800)]
drm/amdgpu: correct RLC firmwares loading sequence

Per confirmation with RLC firmware team, the RLC should
be unhalted after all RLC related firmwares uploaded.
However, in fact the RLC is unhalted immediately after
RLCG firmware uploaded. And that may causes unexpected
PSP hang on loading the succeeding RLC save restore
list related firmwares.
So, we correct the firmware loading sequence to load
RLC save restore list related firmwares before RLCG
ucode. That will help to get around this issue.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Tue, 31 Dec 2019 19:14:58 +0000 (11:14 -0800)]
Merge git://git./linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

 1) Fix big endian overflow in nf_flow_table, from Arnd Bergmann.

 2) Fix port selection on big endian in nft_tproxy, from Phil Sutter.

 3) Fix precision tracking for unbound scalars in bpf verifier, from
    Daniel Borkmann.

 4) Fix integer overflow in socket rcvbuf check in UDP, from Antonio
    Messina.

 5) Do not perform a neigh confirmation during a pmtu update over a
    tunnel, from Hangbin Liu.

 6) Fix DMA mapping leak in dpaa_eth driver, from Madalin Bucur.

 7) Various PTP fixes for sja1105 dsa driver, from Vladimir Oltean.

 8) Add missing to dummy definition of of_mdiobus_child_is_phy(), from
    Geert Uytterhoeven

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
  net/sched: add delete_empty() to filters and use it in cls_flower
  tcp: Fix highest_sack and highest_sack_seq
  ptp: fix the race between the release of ptp_clock and cdev
  net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S
  Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation
  net: dsa: sja1105: Remove restriction of zero base-time for taprio offload
  net: dsa: sja1105: Really make the PTP command read-write
  net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot
  cxgb4/cxgb4vf: fix flow control display for auto negotiation
  mlxsw: spectrum: Use dedicated policer for VRRP packets
  mlxsw: spectrum_router: Skip loopback RIFs during MAC validation
  net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs
  net/sched: act_mirred: Pull mac prior redir to non mac_header_xmit device
  net_sched: sch_fq: properly set sk->sk_pacing_status
  bnx2x: Fix accounting of vlan resources among the PFs
  bnx2x: Use appropriate define for vlan credit
  of: mdio: Add missing inline to of_mdiobus_child_is_phy() dummy
  net: phy: aquantia: add suspend / resume ops for AQR105
  dpaa_eth: fix DMA mapping leak
  ...

4 years agoMerge tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1
Linus Torvalds [Tue, 31 Dec 2019 18:51:27 +0000 (10:51 -0800)]
Merge tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull tomoyo fixes from Tetsuo Handa:
 "Two bug fixes:

   - Suppress RCU warning at list_for_each_entry_rcu()

   - Don't use fancy names on sockets"

* tag 'tomoyo-fixes-for-5.5' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
  tomoyo: Suppress RCU warning at list_for_each_entry_rcu().
  tomoyo: Don't use nifty names on sockets.

4 years agohsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
Taehee Yoo [Sat, 28 Dec 2019 16:28:09 +0000 (16:28 +0000)]
hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()

hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link set dummy0 name ap

Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666]  dump_stack+0x96/0xdb
[21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666]  __kasan_report+0x12a/0x16f
[21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666]  kasan_report+0xe/0x20
[21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666]  ? __module_text_address+0x13/0x140
[21071.945897][T22666]  notifier_call_chain+0x90/0x160
[21071.946743][T22666]  dev_change_name+0x419/0x840
[21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
[21071.951991][T22666]  do_setlink+0x811/0x2ef0
[21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
[ ... ]

Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dcd3 ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/sched: add delete_empty() to filters and use it in cls_flower
Davide Caratti [Sat, 28 Dec 2019 15:36:58 +0000 (16:36 +0100)]
net/sched: add delete_empty() to filters and use it in cls_flower

Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa194b7159d1191817b20e076f55f0e620.

Changes since v1:
 - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
   is used, thanks to Vlad Buslov
 - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
 - squash revert and new fix in a single patch, to be nice with bisect
   tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa194b ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotcp: Fix highest_sack and highest_sack_seq
Cambda Zhu [Fri, 27 Dec 2019 08:52:37 +0000 (16:52 +0800)]
tcp: Fix highest_sack and highest_sack_seq

>From commit 50895b9de1d3 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.

This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.

Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoptp: fix the race between the release of ptp_clock and cdev
Vladis Dronov [Fri, 27 Dec 2019 02:26:27 +0000 (03:26 +0100)]
ptp: fix the race between the release of ptp_clock and cdev

In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
device is removed, closing this file leads to a race. This reproduces
easily in a kvm virtual machine:

ts# cat openptp0.c
int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
ts# uname -r
5.5.0-rc3-46cf053e
ts# cat /proc/cmdline
... slub_debug=FZP
ts# modprobe ptp_kvm
ts# ./openptp0 &
[1] 670
opened /dev/ptp0, sleeping 10s...
ts# rmmod ptp_kvm
ts# ls /dev/ptp*
ls: cannot access '/dev/ptp*': No such file or directory
ts# ...woken up
[   48.010809] general protection fault: 0000 [#1] SMP
[   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
[   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
[   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
[   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
[   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
[   48.019470] ...                                              ^^^ a slub poison
[   48.023854] Call Trace:
[   48.024050]  __fput+0x21f/0x240
[   48.024288]  task_work_run+0x79/0x90
[   48.024555]  do_exit+0x2af/0xab0
[   48.024799]  ? vfs_write+0x16a/0x190
[   48.025082]  do_group_exit+0x35/0x90
[   48.025387]  __x64_sys_exit_group+0xf/0x10
[   48.025737]  do_syscall_64+0x3d/0x130
[   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   48.026479] RIP: 0033:0x7f53b12082f6
[   48.026792] ...
[   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
[   48.045001] Fixing recursive fault but reboot is needed!

This happens in:

static void __fput(struct file *file)
{   ...
    if (file->f_op->release)
        file->f_op->release(inode, file); <<< cdev is kfree'd here
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
             !(mode & FMODE_PATH))) {
        cdev_put(inode->i_cdev); <<< cdev fields are accessed here

Namely:

__fput()
  posix_clock_release()
    kref_put(&clk->kref, delete_clock) <<< the last reference
      delete_clock()
        delete_ptp_clock()
          kfree(ptp) <<< cdev is embedded in ptp
  cdev_put
    module_put(p->owner) <<< *p is kfree'd, bang!

Here cdev is embedded in posix_clock which is embedded in ptp_clock.
The race happens because ptp_clock's lifetime is controlled by two
refcounts: kref and cdev.kobj in posix_clock. This is wrong.

Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
created especially for such cases. This way the parent device with its
ptp_clock is not released until all references to the cdev are released.
This adds a requirement that an initialized but not exposed struct
device should be provided to posix_clock_register() by a caller instead
of a simple dev_t.

This approach was adopted from the commit 72139dfa2464 ("watchdog: Fix
the race between the release of watchdog_core_data and cdev"). See
details of the implementation in the commit 233ed09d7fda ("chardev: add
helper function to register char devs with a struct device").

Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u
Analyzed-by: Stephen Johnston <sjohnsto@redhat.com>
Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>