openwrt/staging/blogic.git
5 years agonet/mlx5e: Additional check for flow destination comparison
Dmytro Linkin [Thu, 2 May 2019 15:21:38 +0000 (15:21 +0000)]
net/mlx5e: Additional check for flow destination comparison

Flow destination comparison has an inaccuracy: code see no
difference between same vf ports, which belong to different pfs.

Example: If start ping from VF0 (PF1) to VF1 (PF1) and mirror
all traffic to VF0 (PF2), icmp reply to VF0 (PF1) and mirrored
flow to VF0 (PF2) would be determined as same destination. It lead
to creating flow handler with rule nodes, which not added to node
tree. When later driver try to delete this flow rules we got
kernel crash.

Add comparison of vhca_id field to avoid this.

Fixes: 1228e912c934 ("net/mlx5: Consider encapsulation properties when comparing destinations")
Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Add missing ethtool driver info for representors
Dmytro Linkin [Thu, 25 Apr 2019 08:52:02 +0000 (08:52 +0000)]
net/mlx5e: Add missing ethtool driver info for representors

For all representors added firmware version info to show in
ethtool driver info.
For uplink representor, because only it is tied to the pci device
sysfs, added pci bus info.

Fixes: ff9b85de5d5d ("net/mlx5e: Add some ethtool port control entries to the uplink rep netdev")
Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Reviewed-by: Gavi Teitz <gavi@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Fix number of vports for ingress ACL configuration
Eli Britstein [Mon, 13 May 2019 09:06:02 +0000 (09:06 +0000)]
net/mlx5e: Fix number of vports for ingress ACL configuration

With the cited commit, ACLs are configured for the VF ports. The loop
for the number of ports had the wrong number. Fix it.

Fixes: 184867373d8c ("net/mlx5e: ACLs for priority tag mode")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
Saeed Mahameed [Tue, 7 May 2019 19:59:38 +0000 (12:59 -0700)]
net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled

ethtool user spaces needs to know ring count via ETHTOOL_GRXRINGS when
executing (ethtool -x) which is retrieved via ethtool get_rxnfc callback,
in mlx5 this callback is disabled when CONFIG_MLX5_EN_RXNFC=n.

This patch allows only ETHTOOL_GRXRINGS command on mlx5e_get_rxnfc() when
CONFIG_MLX5_EN_RXNFC is disabled, so ethtool -x will continue working.

Fixes: fe6d86b3c316 ("net/mlx5e: Add CONFIG_MLX5_EN_RXNFC for ethtool rx nfc")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Fix wrong xmit_more application
Tariq Toukan [Wed, 15 May 2019 12:57:13 +0000 (15:57 +0300)]
net/mlx5e: Fix wrong xmit_more application

Cited patch refactored the xmit_more indication while not preserving
its functionality. Fix it.

Fixes: 3c31ff22b25f ("drivers: mellanox: use netdev_xmit_more() helper")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Fix peer pf disable hca command
Bodong Wang [Mon, 29 Apr 2019 14:56:18 +0000 (09:56 -0500)]
net/mlx5: Fix peer pf disable hca command

The command was mistakenly using enable_hca in embedded CPU field.

Fixes: 22e939a91dcb (net/mlx5: Update enable HCA dependency)
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reported-by: Alex Rosenbaum <alexr@mellanox.com>
Signed-off-by: Alex Rosenbaum <alexr@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
Parav Pandit [Fri, 5 Apr 2019 06:07:19 +0000 (01:07 -0500)]
net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index

To avoid any ambiguity between vport index and vport number,
rename functions that had vport, to vport_num or vport_index appropriately.

vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return
type to u16.

vport_index is an int in vport array. Hence change input type of vport
index in mlx5_eswitch_index_to_vport_num() to int.

Correct multiple eswitch representor interfaces use type u16 of
rep->vport as type int vport_index.

Send vport FW commands with correct eswitch u16 vport_num instead
host int vport_index.

Fixes: 5ae5162066d8 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Reviewed-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add meaningful return codes to status_to_err function
Valentine Fatiev [Wed, 1 May 2019 08:46:05 +0000 (11:46 +0300)]
net/mlx5: Add meaningful return codes to status_to_err function

Current version of function status_to_err return -1 for any
status returned by mlx5_cmd_invoke function. In case status is
MLX5_DRIVER_STATUS_ABORTED we should return 0 to the caller as we
assume command completed successfully on FW. If error returned we are
getting confusing messages in dmesg. In addition, currently returned
value -1 is confusing with -EPERM.

New implementation actually fix original commit and return meaningful
codes for commands delivery status and print message in case of failure.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Valentine Fatiev <valentinef@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Imply MLXFW in mlx5_core
Saeed Mahameed [Tue, 7 May 2019 20:15:20 +0000 (13:15 -0700)]
net/mlx5: Imply MLXFW in mlx5_core

mlxfw can be compiled as external module while mlx5_core can be
builtin, in such case mlx5 will act like mlxfw is disabled.

Since mlxfw is just a service library for mlx* drivers,
imply it in mlx5_core to make it always reachable if it was enabled.

Fixes: 3ffaabecd1a1 ("net/mlx5e: Support the flash device ethtool callback")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRevert "tipc: fix modprobe tipc failed after switch order of device registration"
David S. Miller [Fri, 17 May 2019 19:15:05 +0000 (12:15 -0700)]
Revert "tipc: fix modprobe tipc failed after switch order of device registration"

This reverts commit 532b0f7ece4cb2ffd24dc723ddf55242d1188e5e.

More revisions coming up.

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovsock/virtio: free packets during the socket release
Stefano Garzarella [Fri, 17 May 2019 14:45:43 +0000 (16:45 +0200)]
vsock/virtio: free packets during the socket release

When the socket is released, we should free all packets
queued in the per-socket list in order to avoid a memory
leak.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: fix modprobe tipc failed after switch order of device registration
Junwei Hu [Fri, 17 May 2019 11:27:34 +0000 (19:27 +0800)]
tipc: fix modprobe tipc failed after switch order of device registration

Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 7e27e8d6130c
("tipc: switch order of device registration to fix a crash")

Because sock_create_kern(net, AF_TIPC, ...) is called by
tipc_topsrv_create_listener() in the initialization process
of tipc_net_ops, tipc_socket_init() must be execute before that.

I move tipc_socket_init() into function tipc_init_net().

Fixes: 7e27e8d6130c
("tipc: switch order of device registration to fix a crash")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reviewed-by: Kang Zhou <zhoukang7@huawei.com>
Reviewed-by: Suanming Mou <mousuanming@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agolib: Correct comment of prandom_seed
Philippe Mazenauer [Fri, 17 May 2019 10:44:44 +0000 (10:44 +0000)]
lib: Correct comment of prandom_seed

Variable 'entropy' was wrongly documented as 'seed', changed comment to
reflect actual variable name.

../lib/random32.c:179: warning: Function parameter or member 'entropy' not described in 'prandom_seed'
../lib/random32.c:179: warning: Excess function parameter 'seed' description in 'prandom_seed'

Signed-off-by: Philippe Mazenauer <philippe.mazenauer@outlook.de>
Acked-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: caif: fix the value of size argument of snprintf
swkhack [Fri, 17 May 2019 07:59:22 +0000 (15:59 +0800)]
net: caif: fix the value of size argument of snprintf

Because the function snprintf write at most size bytes(including the
null byte).So the value of the argument size need not to minus one.

Signed-off-by: swkhack <swkhack@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: fix src addr routing with the exception table
Wei Wang [Thu, 16 May 2019 20:30:54 +0000 (13:30 -0700)]
ipv6: fix src addr routing with the exception table

When inserting route cache into the exception table, the key is
generated with both src_addr and dest_addr with src addr routing.
However, current logic always assumes the src_addr used to generate the
key is a /128 host address. This is not true in the following scenarios:
1. When the route is a gateway route or does not have next hop.
   (rt6_is_gw_or_nonexthop() == false)
2. When calling ip6_rt_cache_alloc(), saddr is passed in as NULL.
This means, when looking for a route cache in the exception table, we
have to do the lookup twice: first time with the passed in /128 host
address, second time with the src_addr stored in fib6_info.

This solves the pmtu discovery issue reported by Mikael Magnusson where
a route cache with a lower mtu info is created for a gateway route with
src addr. However, the lookup code is not able to find this route cache.

Fixes: 2b760fcf5cfb ("ipv6: hook up exception table to store dst cache")
Reported-by: Mikael Magnusson <mikael.kernel@lists.m7n.se>
Bisected-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoselftests: pmtu.sh: Remove quotes around commands in setup_xfrm
David Ahern [Thu, 16 May 2019 17:41:31 +0000 (10:41 -0700)]
selftests: pmtu.sh: Remove quotes around commands in setup_xfrm

The first command in setup_xfrm is failing resulting in the test getting
skipped:

+ ip netns exec ns-B ip -6 xfrm state add src fd00:1::a dst fd00:1::b spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
+ out=RTNETLINK answers: Function not implemented
...
  xfrm6 not supported
TEST: vti6: PMTU exceptions                                         [SKIP]
  xfrm4 not supported
TEST: vti4: PMTU exceptions                                         [SKIP]
...

The setup command started failing when the run_cmd option was added.
Removing the quotes fixes the problem:
...
TEST: vti6: PMTU exceptions                                         [ OK ]
TEST: vti4: PMTU exceptions                                         [ OK ]
...

Fixes: 56490b623aa0 ("selftests: Add debugging options to pmtu.sh")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: avoid weird emergency message
Eric Dumazet [Thu, 16 May 2019 15:09:57 +0000 (08:09 -0700)]
net: avoid weird emergency message

When host is under high stress, it is very possible thread
running netdev_wait_allrefs() returns from msleep(250)
10 seconds late.

This leads to these messages in the syslog :

[...] unregister_netdevice: waiting for syz_tun to become free. Usage count = 0

If the device refcount is zero, the wait is over.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'aqc111-revert-endianess-fixes-and-cleanup-mtu-logic'
David S. Miller [Thu, 16 May 2019 21:22:14 +0000 (14:22 -0700)]
Merge branch 'aqc111-revert-endianess-fixes-and-cleanup-mtu-logic'

Igor Russkikh says:

====================
aqc111: revert endianess fixes and cleanup mtu logic

This reverts no-op commits as it was discussed:

https://lore.kernel.org/netdev/1557839644.11261.4.camel@suse.com/

First and second original patches are already dropped from stable,
No need to stable-queue the third patch as it has no functional impact,
just a logic cleanup.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoaqc111: cleanup mtu related logic
Igor Russkikh [Thu, 16 May 2019 14:52:25 +0000 (14:52 +0000)]
aqc111: cleanup mtu related logic

Original fix b8b277525e9d was done under impression that invalid data
could be written for mtu configuration higher that 16334.

But the high limit will anyway be rejected my max_mtu check in caller.
Thus, make the code cleaner and allow it doing the configuration without
checking for maximum mtu value.

Fixes: b8b277525e9d ("aqc111: fix endianness issue in aqc111_change_mtu")
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoRevert "aqc111: fix writing to the phy on BE"
Igor Russkikh [Thu, 16 May 2019 14:52:22 +0000 (14:52 +0000)]
Revert "aqc111: fix writing to the phy on BE"

This reverts commit 369b46e9fbcfa5136f2cb5f486c90e5f7fa92630.

The required temporary storage is already done inside of write32/16
helpers.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoRevert "aqc111: fix double endianness swap on BE"
Igor Russkikh [Thu, 16 May 2019 14:52:20 +0000 (14:52 +0000)]
Revert "aqc111: fix double endianness swap on BE"

This reverts commit 2cf672709beb005f6e90cb4edbed6f2218ba953e.

The required temporary storage is already done inside of write32/16
helpers.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoxfrm: ressurrect "Fix uninitialized memory read in _decode_session4"
Florian Westphal [Thu, 16 May 2019 09:28:16 +0000 (11:28 +0200)]
xfrm: ressurrect "Fix uninitialized memory read in _decode_session4"

This resurrects commit 8742dc86d0c7a9628
("xfrm4: Fix uninitialized memory read in _decode_session4"),
which got lost during a merge conflict resolution between ipsec-next
and net-next tree.

c53ac41e3720 ("xfrm: remove decode_session indirection from afinfo_policy")
in ipsec-next moved the (buggy) _decode_session4 from
net/ipv4/xfrm4_policy.c to net/xfrm/xfrm_policy.c.
In mean time, 8742dc86d0c7a was applied to ipsec.git and fixed the
problem in the "old" location.

When the trees got merged, the moved, old function was kept.
This applies the "lost" commit again, to the new location.

Fixes: a658a3f2ecbab ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotipc: switch order of device registration to fix a crash
Junwei Hu [Thu, 16 May 2019 02:51:15 +0000 (10:51 +0800)]
tipc: switch order of device registration to fix a crash

When tipc is loaded while many processes try to create a TIPC socket,
a crash occurs:
 PANIC: Unable to handle kernel paging request at virtual
 address "dfff20000000021d"
 pc : tipc_sk_create+0x374/0x1180 [tipc]
 lr : tipc_sk_create+0x374/0x1180 [tipc]
   Exception class = DABT (current EL), IL = 32 bits
 Call trace:
  tipc_sk_create+0x374/0x1180 [tipc]
  __sock_create+0x1cc/0x408
  __sys_socket+0xec/0x1f0
  __arm64_sys_socket+0x74/0xa8
 ...

This is due to race between sock_create and unfinished
register_pernet_device. tipc_sk_insert tries to do
"net_generic(net, tipc_net_id)".
but tipc_net_id is not initialized yet.

So switch the order of the two to close the race.

This can be reproduced with multiple processes doing socket(AF_TIPC, ...)
and one process doing module removal.

Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reviewed-by: Xiaogang Wang <wangxiaogang3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoipv6: prevent possible fib6 leaks
Eric Dumazet [Thu, 16 May 2019 02:39:52 +0000 (19:39 -0700)]
ipv6: prevent possible fib6 leaks

At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible
for finding all percpu routes and set their ->from pointer
to NULL, so that fib6_ref can reach its expected value (1).

The problem right now is that other cpus can still catch the
route being deleted, since there is no rcu grace period
between the route deletion and call to fib6_drop_pcpu_from()

This can leak the fib6 and associated resources, since no
notifier will take care of removing the last reference(s).

I decided to add another boolean (fib6_destroying) instead
of reusing/renaming exception_bucket_flushed to ease stable backports,
and properly document the memory barriers used to implement this fix.

This patch has been co-developped with Wei Wang.

Fixes: 93531c674315 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Wei Wang <weiwan@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: test nouarg before dereferencing zerocopy pointers
Willem de Bruijn [Wed, 15 May 2019 17:29:16 +0000 (13:29 -0400)]
net: test nouarg before dereferencing zerocopy pointers

Zerocopy skbs without completion notification were added for packet
sockets with PACKET_TX_RING user buffers. Those signal completion
through the TP_STATUS_USER bit in the ring. Zerocopy annotation was
added only to avoid premature notification after clone or orphan, by
triggering a copy on these paths for these packets.

The mechanism had to define a special "no-uarg" mode because packet
sockets already use skb_uarg(skb) == skb_shinfo(skb)->destructor_arg
for a different pointer.

Before deferencing skb_uarg(skb), verify that it is a real pointer.

Fixes: 5cd8d46ea1562 ("packet: copy user buffers before orphan or clone")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: usb: qmi_wwan: add Telit 0x1260 and 0x1261 compositions
Daniele Palmas [Wed, 15 May 2019 15:29:43 +0000 (17:29 +0200)]
net: usb: qmi_wwan: add Telit 0x1260 and 0x1261 compositions

Added support for Telit LE910Cx 0x1260 and 0x1261 compositions.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: phy: aquantia: readd XGMII support for AQR107
Madalin-cristian Bucur [Wed, 15 May 2019 15:07:44 +0000 (15:07 +0000)]
net: phy: aquantia: readd XGMII support for AQR107

XGMII interface mode no longer works on AQR107 after the recent changes,
adding back support.

Fixes: 570c8a7d5303 ("net: phy: aquantia: check for supported interface modes in config_init")
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: bpfilter: fallback to netfilter if failed to load bpfilter kernel module
Konstantin Khlebnikov [Wed, 15 May 2019 11:40:52 +0000 (14:40 +0300)]
net: bpfilter: fallback to netfilter if failed to load bpfilter kernel module

If bpfilter is not available return ENOPROTOOPT to fallback to netfilter.

Function request_module() returns both errors and userspace exit codes.
Just ignore them. Rechecking bpfilter_ops is enough.

Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agohv_sock: Add support for delayed close
Sunil Muthuswamy [Wed, 15 May 2019 00:56:05 +0000 (00:56 +0000)]
hv_sock: Add support for delayed close

Currently, hvsock does not implement any delayed or background close
logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and
the last reference to the socket is dropped, which leads to a call to
.destruct where the socket can hang indefinitely waiting for the peer to
close it's side. The can cause the user application to hang in the close()
call.

This change implements proper STREAM(TCP) closing handshake mechanism by
sending the FIN to the peer and the waiting for the peer's FIN to arrive
for a given timeout. On timeout, it will try to terminate the connection
(i.e. a RST). This is in-line with other socket providers such as virtio.

This change does not address the hang in the vmbus_hvsock_device_unregister
where it waits indefinitely for the host to rescind the channel. That
should be taken up as a separate fix.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoatm: iphase: Avoid copying pointers to user space.
Fuqian Huang [Wed, 15 May 2019 00:42:48 +0000 (08:42 +0800)]
atm: iphase: Avoid copying pointers to user space.

Remove the MEMDUMP_DEV case in ia_ioctl to avoid copy
pointers to user space.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'flow_offload-fix-CVLAN-support'
David S. Miller [Thu, 16 May 2019 19:02:42 +0000 (12:02 -0700)]
Merge branch 'flow_offload-fix-CVLAN-support'

Edward Cree says:

====================
flow_offload: fix CVLAN support

When the flow_offload infrastructure was added, CVLAN matches weren't
 plumbed through, and flow_rule_match_vlan() was incorrectly called in
 the mlx5 driver when populating CVLAN match information.  This series
 adds flow_rule_match_cvlan(), and uses it in the mlx5 code.
Both patches should also go to 5.1 stable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx5e: Fix calling wrong function to get inner vlan key and mask
Jianbo Liu [Tue, 14 May 2019 20:18:50 +0000 (21:18 +0100)]
net/mlx5e: Fix calling wrong function to get inner vlan key and mask

When flow_rule_match_XYZ() functions were first introduced,
flow_rule_match_cvlan() for inner vlan is missing.

In mlx5_core driver, to get inner vlan key and mask, flow_rule_match_vlan()
is just called, which is wrong because it obtains outer vlan information by
FLOW_DISSECTOR_KEY_VLAN.

This commit fixes this by changing to call flow_rule_match_cvlan() after
it's added.

Fixes: 8f2566225ae2 ("flow_offload: add flow_rule and flow_match structures and use them")
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoflow_offload: support CVLAN match
Edward Cree [Tue, 14 May 2019 20:18:12 +0000 (21:18 +0100)]
flow_offload: support CVLAN match

Plumb it through from the flow_dissector.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'rhashtable-Fix-sparse-warnings'
David S. Miller [Thu, 16 May 2019 16:45:20 +0000 (09:45 -0700)]
Merge branch 'rhashtable-Fix-sparse-warnings'

Herbert Xu says:

====================
rhashtable: Fix sparse warnings

This patch series fixes all the sparse warnings.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agorhashtable: Fix cmpxchg RCU warnings
Herbert Xu [Thu, 16 May 2019 07:19:48 +0000 (15:19 +0800)]
rhashtable: Fix cmpxchg RCU warnings

As cmpxchg is a non-RCU mechanism it will cause sparse warnings
when we use it for RCU.  This patch adds explicit casts to silence
those warnings.  This should probably be moved to RCU itself in
future.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agorhashtable: Remove RCU marking from rhash_lock_head
Herbert Xu [Thu, 16 May 2019 07:19:46 +0000 (15:19 +0800)]
rhashtable: Remove RCU marking from rhash_lock_head

The opaque type rhash_lock_head should not be marked with __rcu
because it can never be dereferenced.  We should apply the RCU
marking when we turn it into a pointer which can be dereferenced.

This patch does exactly that.  This fixes a number of sparse
warnings as well as getting rid of some unnecessary RCU checking.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Thu, 16 May 2019 01:28:44 +0000 (18:28 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2019-05-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a use after free in __dev_map_entry_free(), from Eric.

2) Several sockmap related bug fixes: a splat in strparser if
   it was never initialized, remove duplicate ingress msg list
   purging which can race, fix msg->sg.size accounting upon
   skb to msg conversion, and last but not least fix a timeout
   bug in tcp_bpf_wait_data(), from John.

3) Fix LRU map to avoid messing with eviction heuristics upon
   syscall lookup, e.g. map walks from user space side will
   then lead to eviction of just recently created entries on
   updates as it would mark all map entries, from Daniel.

4) Don't bail out when libbpf feature probing fails. Also
   various smaller fixes to flow_dissector test, from Stanislav.

5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agobpf, tcp: correctly handle DONT_WAIT flags and timeo == 0
John Fastabend [Tue, 14 May 2019 04:42:03 +0000 (21:42 -0700)]
bpf, tcp: correctly handle DONT_WAIT flags and timeo == 0

The tcp_bpf_wait_data() routine needs to check timeo != 0 before
calling sk_wait_event() otherwise we may see unexpected stalls
on receiver.

Arika did all the leg work here I just formatted, posted and ran
a few tests.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Arika Chen <eaglesora@gmail.com>
Suggested-by: Arika Chen <eaglesora@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: add prog detach to flow_dissector test
Stanislav Fomichev [Tue, 14 May 2019 21:12:34 +0000 (14:12 -0700)]
selftests/bpf: add prog detach to flow_dissector test

In case we are not running in a namespace (which we don't do by default),
let's try to detach the bpf program that we use for eth_get_headlen tests.

Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agoselftests/bpf: add missing \n to flow_dissector CHECK errors
Stanislav Fomichev [Tue, 14 May 2019 21:12:33 +0000 (14:12 -0700)]
selftests/bpf: add missing \n to flow_dissector CHECK errors

Otherwise, in case of an error, everything gets mushed together.

Fixes: a5cb33464e53 ("selftests/bpf: make flow dissector tests more extensible")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agolibbpf: don't fail when feature probing fails
Stanislav Fomichev [Wed, 15 May 2019 03:38:49 +0000 (20:38 -0700)]
libbpf: don't fail when feature probing fails

Otherwise libbpf is unusable from unprivileged process with
kernel.kernel.unprivileged_bpf_disabled=1.
All I get is EPERM from the probes, even if I just want to
open an ELF object and look at what progs/maps it has.

Instead of dying on probes, let's just pr_debug the error and
try to continue.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
5 years agotcp: do not recycle cloned skbs
Eric Dumazet [Wed, 15 May 2019 16:10:15 +0000 (09:10 -0700)]
tcp: do not recycle cloned skbs

It is illegal to change arbitrary fields in skb_shared_info if the
skb is cloned.

Before calling skb_zcopy_clear() we need to ensure this rule,
therefore we need to move the test from sk_stream_alloc_skb()
to sk_wmem_free_skb()

Fixes: 4f661542a402 ("tcp: fix zerocopy and notsent_lowat issues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoenetc: Add missing link state info for ethtool
Claudiu Manoil [Wed, 15 May 2019 16:08:58 +0000 (19:08 +0300)]
enetc: Add missing link state info for ethtool

Just hook get_link to standard ethtool_op_get_link,
nothing special needed at this point.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoenetc: Allow to disable Tx SG
Claudiu Manoil [Wed, 15 May 2019 16:08:57 +0000 (19:08 +0300)]
enetc: Allow to disable Tx SG

The fact that the Tx SG flag is fixed to 'on' is only
an oversight. Non-SG mode is also supported. Fix this
by allowing to turn SG off.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoenetc: Fix NULL dma address unmap for Tx BD extensions
Claudiu Manoil [Wed, 15 May 2019 16:08:56 +0000 (19:08 +0300)]
enetc: Fix NULL dma address unmap for Tx BD extensions

For the unlikely case of TxBD extensions (i.e. ptp)
the driver tries to unmap the tx_swbd corresponding
to the extension, which is bogus as it has no buffer
attached.

Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonfp: flower: add rcu locks when accessing netdev for tunnels
Pieter Jansen van Vuuren [Tue, 14 May 2019 21:28:19 +0000 (14:28 -0700)]
nfp: flower: add rcu locks when accessing netdev for tunnels

Add rcu locks when accessing netdev when processing route request
and tunnel keep alive messages received from hardware.

Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload")
Fixes: 856f5b135758 ("nfp: flower vxlan neighbour keep-alive")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoppp: deflate: Fix possible crash in deflate_init
YueHaibing [Tue, 14 May 2019 14:55:32 +0000 (22:55 +0800)]
ppp: deflate: Fix possible crash in deflate_init

BUG: unable to handle kernel paging request at ffffffffa018f000
PGD 3270067 P4D 3270067 PUD 3271063 PMD 2307eb067 PTE 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 4138 Comm: modprobe Not tainted 5.1.0-rc7+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:ppp_register_compressor+0x3e/0xd0 [ppp_generic]
Code: 98 4a 3f e2 48 8b 15 c1 67 00 00 41 8b 0c 24 48 81 fa 40 f0 19 a0
75 0e eb 35 48 8b 12 48 81 fa 40 f0 19 a0 74
RSP: 0018:ffffc90000d93c68 EFLAGS: 00010287
RAX: ffffffffa018f000 RBX: ffffffffa01a3000 RCX: 000000000000001a
RDX: ffff888230c750a0 RSI: 0000000000000000 RDI: ffffffffa019f000
RBP: ffffc90000d93c80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0194080
R13: ffff88822ee1a700 R14: 0000000000000000 R15: ffffc90000d93e78
FS:  00007f2339557540(0000) GS:ffff888237a00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa018f000 CR3: 000000022bde4000 CR4: 00000000000006f0
Call Trace:
 ? 0xffffffffa01a3000
 deflate_init+0x11/0x1000 [ppp_deflate]
 ? 0xffffffffa01a3000
 do_one_initcall+0x6c/0x3cc
 ? kmem_cache_alloc_trace+0x248/0x3b0
 do_init_module+0x5b/0x1f1
 load_module+0x1db1/0x2690
 ? m_show+0x1d0/0x1d0
 __do_sys_finit_module+0xc5/0xd0
 __x64_sys_finit_module+0x15/0x20
 do_syscall_64+0x6b/0x1d0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

If ppp_deflate fails to register in deflate_init,
module initialization failed out, however
ppp_deflate_draft may has been regiestred and not
unregistered before return.
Then the seconed modprobe will trigger crash like this.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: macb: fix error format in dev_err()
Luca Ceresoli [Tue, 14 May 2019 13:23:07 +0000 (15:23 +0200)]
net: macb: fix error format in dev_err()

Errors are negative numbers. Using %u shows them as very large positive
numbers such as 4294967277 that don't make sense. Use the %d format
instead, and get a much nicer -19.

Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net>
Fixes: b48e0bab142f ("net: macb: Migrate to devm clock interface")
Fixes: 93b31f48b3ba ("net/macb: unify clock management")
Fixes: 421d9df0628b ("net/macb: merge at91_ether driver into macb driver")
Fixes: aead88bd0e99 ("net: ethernet: macb: Add support for rx_clk")
Fixes: f5473d1d44e4 ("net: macb: Support clock management for tsu_clk")
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agortnetlink: always put IFLA_LINK for links with a link-netnsid
Sabrina Dubroca [Tue, 14 May 2019 13:12:19 +0000 (15:12 +0200)]
rtnetlink: always put IFLA_LINK for links with a link-netnsid

Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when
iflink == ifindex.

In some cases, a device can be created in a different netns with the
same ifindex as its parent. That device will not dump its IFLA_LINK
attribute, which can confuse some userspace software that expects it.
For example, if the last ifindex created in init_net and foo are both
8, these commands will trigger the issue:

    ip link add parent type dummy                   # ifindex 9
    ip link add link parent netns foo type macvlan  # ifindex 9 in ns foo

So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump,
always put the IFLA_LINK attribute as well.

Thanks to Dan Winship for analyzing the original OpenShift bug down to
the missing netlink attribute.

v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said
    add Nicolas' ack
v3: change Fixes tag
    fix subject typo, spotted by Edward Cree

Analyzed-by: Dan Winship <danw@redhat.com>
Fixes: d8a5ec672768 ("[NET]: netlink support for moving devices between network namespaces.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx4_core: Change the error print to info print
Yunjian Wang [Tue, 14 May 2019 11:03:19 +0000 (19:03 +0800)]
net/mlx4_core: Change the error print to info print

The error print within mlx4_flow_steer_promisc_add() should
be a info print.

Fixes: 592e49dda812 ('net/mlx4: Implement promiscuous mode with device managed flow-steering')
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoNFC: Orphan the subsystem
Johannes Berg [Tue, 14 May 2019 09:02:31 +0000 (11:02 +0200)]
NFC: Orphan the subsystem

Samuel clearly hasn't been working on this in many years and
patches getting to the wireless list are just being ignored
entirely now. Mark the subsystem as orphan to reflect the
current state and revert back to the netdev list so at least
some fixes can be picked up by Dave.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Always descend into dsa/
Florian Fainelli [Mon, 13 May 2019 21:06:24 +0000 (14:06 -0700)]
net: Always descend into dsa/

Jiri reported that with a kernel built with CONFIG_FIXED_PHY=y,
CONFIG_NET_DSA=m and CONFIG_NET_DSA_LOOP=m, we would not get to a
functional state where the mock-up driver is registered. Turns out that
we are not descending into drivers/net/dsa/ unconditionally, and we
won't be able to link-in dsa_loop_bdinfo.o which does the actual mock-up
mdio device registration.

Reported-by: Jiri Pirko <jiri@resnulli.us>
Fixes: 40013ff20b1b ("net: dsa: Fix functional dsa-loop dependency on FIXED_PHY")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Tested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: fix retrans timestamp on passive Fast Open
Yuchung Cheng [Mon, 13 May 2019 17:32:05 +0000 (10:32 -0700)]
tcp: fix retrans timestamp on passive Fast Open

Commit c7d13c8faa74 ("tcp: properly track retry time on
passive Fast Open") sets the start of SYNACK retransmission
time on passive Fast Open in "retrans_stamp". However the
timestamp is not reset upon the handshake has completed. As a
result, future data packet retransmission may not update it in
tcp_retransmit_skb(). This may lead to socket aborting earlier
unexpectedly by retransmits_timed_out() since retrans_stamp remains
the SYNACK rtx time.

This bug only manifests on passive TFO sender that a) suffered
SYNACK timeout and then b) stalls on very first loss recovery. Any
successful loss recovery would reset the timestamp to avoid this
issue.

Fixes: c7d13c8faa74 ("tcp: properly track retry time on passive Fast Open")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Tue, 14 May 2019 21:12:59 +0000 (14:12 -0700)]
Merge tag 'for_linus' of git://git./linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:

 - enable packed ring support for s390

 - several fixes

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio/s390: enable packed ring
  virtio/s390: DMA support for virtio-ccw
  virtio/s390: use vring_create_virtqueue
  virtio/virtio_ring: do some comment fixes
  vhost-scsi: remove incorrect memory barrier
  tools/virtio/ringtest: Remove bogus definition of BUG_ON()
  virtio_ring: Fix potential mem leak in virtqueue_add_indirect_packed

5 years agoAdd gitignore file for samples/vfs/ generated files
Linus Torvalds [Tue, 14 May 2019 20:30:10 +0000 (13:30 -0700)]
Add gitignore file for samples/vfs/ generated files

Commit f1b5618e013a ("vfs: Add a sample program for the new mount API")
added sample programs that get built during the kernel build, but then
cause 'git status' to worry about whether the resulting binaries should
be managed by git.

Tell git not to worry, and to ignore the sample binaries.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoMerge branch 'parisc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Tue, 14 May 2019 20:17:19 +0000 (13:17 -0700)]
Merge branch 'parisc-5.2-2' of git://git./linux/kernel/git/deller/parisc-linux

Pull more parisc updates from Helge Deller:
 "Two small enhancements, which I didn't included in the last pull
  request because I wanted to keep them a few more days in for-next
  before sending upstream:

   - Replace the ldcw barrier instruction by a nop instruction in the
     CAS code on uniprocessor machines.

   - Map variables read-only after init (enable ro_after_init feature)"

* 'parisc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Use __ro_after_init in init.c
  parisc: Use __ro_after_init in unwind.c
  parisc: Use __ro_after_init in time.c
  parisc: Use __ro_after_init in processor.c
  parisc: Use __ro_after_init in process.c
  parisc: Use __ro_after_init in perf_images.h
  parisc: Use __ro_after_init in pci.c
  parisc: Use __ro_after_init in inventory.c
  parisc: Use __ro_after_init in head.S
  parisc: Use __ro_after_init in firmware.c
  parisc: Use __ro_after_init in drivers.c
  parisc: Use __ro_after_init in cache.c
  parisc: Enable the ro_after_init feature
  parisc: Drop LDCW barrier in CAS code when running UP

5 years agoMerge tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt...
Linus Torvalds [Tue, 14 May 2019 20:14:10 +0000 (13:14 -0700)]
Merge tag 'kgdb-5.2-rc1' of git://git./linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Mostly cleanups but there are also a couple of fixes for out-of-bounds
  accesses (including a potential write to the byte before a static
  buffer).

  The main changes are:

   - Fixes to those out-of-bounds access (empty string to configure test
     module could write the byte before a buffer, high cpu counts could
     read outside of per-cpu structures).

   - Improvements to string handling problems picked up by new compiler
     warnings and other static checks. Most are fixing benign issues
     that can't be tickled without code changes but still reduce the wtf
     factor a little.

   - Tidy up the terminal output"

* tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Fix bound check compiler warning
  kdb: do a sanity check on the cpu in kdb_per_cpu()
  kdb: Get rid of broken attempt to print CCVERSION in kdb summary
  misc: kgdbts: fix out-of-bounds access in function param_set_kgdbts_var
  kdb: kdb_support: replace strcpy() by strscpy()
  gdbstub: Replace strcpy() by strscpy()
  gdbstub: mark expected switch fall-throughs

5 years agoMerge tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu...
Linus Torvalds [Tue, 14 May 2019 17:55:54 +0000 (10:55 -0700)]
Merge tag 'modules-for-v5.2' of git://git./linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:

 - Use a separate table to store symbol types instead of hijacking
   fields in struct Elf_Sym

 - Trivial code cleanups

* tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: add stubs for within_module functions
  kallsyms: store type information in its own array
  vmlinux.lds.h: drop unused __vermagic

5 years agoMerge branch 'lru-map-fix'
Alexei Starovoitov [Tue, 14 May 2019 17:47:29 +0000 (10:47 -0700)]
Merge branch 'lru-map-fix'

Daniel Borkmann says:

====================
This set fixes LRU map eviction in combination with map lookups out
of system call side from user space. Main patch is the second one and
test cases are adapted and added in the last one. Thanks!
====================

Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: test ref bit from data path and add new tests for syscall path
Daniel Borkmann [Mon, 13 May 2019 23:18:57 +0000 (01:18 +0200)]
bpf: test ref bit from data path and add new tests for syscall path

The test_lru_map is relying on marking the LRU map entry via regular
BPF map lookup from system call side. This is basically for simplicity
reasons. Given we fixed marking entries in that case, the test needs
to be fixed as well. Here we add a small drop-in replacement to retain
existing behavior for the tests by marking out of the BPF program and
transferring the retrieved value out via temporary map. This also adds
new test cases to track the new behavior where two elements are marked,
one via system call side and one via program side, where the next update
then evicts the key looked up only from system call side.

  # ./test_lru_map
  nr_cpus:8

  test_lru_sanity0 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity1 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity2 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity3 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity4 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity5 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity7 (map_type:9 map_flags:0x0): Pass
  test_lru_sanity8 (map_type:9 map_flags:0x0): Pass

  test_lru_sanity0 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity1 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity2 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity3 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity4 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity5 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity7 (map_type:10 map_flags:0x0): Pass
  test_lru_sanity8 (map_type:10 map_flags:0x0): Pass

  test_lru_sanity0 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity4 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity6 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity7 (map_type:9 map_flags:0x2): Pass
  test_lru_sanity8 (map_type:9 map_flags:0x2): Pass

  test_lru_sanity0 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity4 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity6 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity7 (map_type:10 map_flags:0x2): Pass
  test_lru_sanity8 (map_type:10 map_flags:0x2): Pass

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf, lru: avoid messing with eviction heuristics upon syscall lookup
Daniel Borkmann [Mon, 13 May 2019 23:18:56 +0000 (01:18 +0200)]
bpf, lru: avoid messing with eviction heuristics upon syscall lookup

One of the biggest issues we face right now with picking LRU map over
regular hash table is that a map walk out of user space, for example,
to just dump the existing entries or to remove certain ones, will
completely mess up LRU eviction heuristics and wrong entries such
as just created ones will get evicted instead. The reason for this
is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
system call lookup side as well. Thus upon walk, all entries are
being marked, so information of actual least recently used ones
are "lost".

In case of Cilium where it can be used (besides others) as a BPF
based connection tracker, this current behavior causes disruption
upon control plane changes that need to walk the map from user space
to evict certain entries. Discussion result from bpfconf [0] was that
we should simply just remove marking from system call side as no
good use case could be found where it's actually needed there.
Therefore this patch removes marking for regular LRU and per-CPU
flavor. If there ever should be a need in future, the behavior could
be selected via map creation flag, but due to mentioned reason we
avoid this here.

  [0] http://vger.kernel.org/bpfconf.html

Fixes: 29ba732acbee ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
Fixes: 8f8449384ec3 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agobpf: add map_lookup_elem_sys_only for lookups from syscall side
Daniel Borkmann [Mon, 13 May 2019 23:18:55 +0000 (01:18 +0200)]
bpf: add map_lookup_elem_sys_only for lookups from syscall side

Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
5 years agoMerge tag 'backlight-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee...
Linus Torvalds [Tue, 14 May 2019 17:45:03 +0000 (10:45 -0700)]
Merge tag 'backlight-next-5.2' of git://git./linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
 "Fix-ups:
   - Remove unused BACKLIGHT_LCD_SUPPORT symbol
   - Remove unused BACKLIGHT_CLASS_DEVICE dependencies
   - Add DT support to lm3630a_bl

  Bug Fixes:
   - Fix error path issues in lm3630a_bl"

* tag 'backlight-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: lm3630a: Add firmware node support
  dt-bindings: backlight: Add lm3630a bindings
  backlight: lm3630a: Return 0 on success in update_status functions
  video: lcd: Remove useless BACKLIGHT_CLASS_DEVICE dependencies
  video: backlight: Remove useless BACKLIGHT_LCD_SUPPORT kernel symbol

5 years agoMerge tag 'mfd-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Linus Torvalds [Tue, 14 May 2019 17:39:08 +0000 (10:39 -0700)]
Merge tag 'mfd-next-5.2' of git://git./linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
 "Core Framework:
   - Document (kerneldoc) core mfd_add_devices() API

  New Drivers:
   - Altera SOCFPGA System Manager
   - Maxim MAX77650/77651 PMIC
   - Maxim MAX77663 PMIC
   - ST Multi-Function eXpander (STMFX)

  New Device Support:
   - LEDs support in Intel Cherry Trail Whiskey Cove PMIC
   - RTC support in SAMSUNG Electronics S2MPA01 PMIC
   - SAM9X60 support in Atmel HLCDC (High-end LCD Controller)
   - USB X-Powers AXP 8xx PMICs
   - Integrated Sensor Hub (ISH) in ChromeOS EC
   - USB PD Logger in ChromeOS EC
   - AXP223 in X-Powers AXP series PMICs
   - Power Supply in X-Powers AXP 803 PMICs
   - Comet Lake in Intel Low Power Subsystem
   - Fingerprint MCU in ChromeOS EC
   - Touchpad MCU in ChromeOS EC
   - Move TI LM3532 support to LED

  New Functionality:
   - max77650, max77620: Add/extend DT support
   - max77620 power-off
   - syscon clocking
   - croc_ec host sleep event

  Fix-ups:
   - Trivial; Formatting, spelling, etc; Kconfig, sec-core, ab8500-debugfs
   - Remove unused functionality; rk808, da9063-*
   - SPDX conversion; da9063-*, atmel-*,
   - Adapt/add new register definitions; cs47l35-tables, cs47l90-tables, imx6q-iomuxc-gpr
   - Fix-up DT bindings; ti-lmu, cirrus,lochnagar
   - Simply obtaining driver data; ssbi, t7l66xb, tc6387xb, tc6393xb

  Bug Fixes:
   - Fix incorrect defined values; max77620, da9063
   - Fix device initialisation; twl6040
   - Reset device on init; intel-lpss
   - Fix build warnings when !OF; sun6i-prcm
   - Register OF match tables; tps65912-spi
   - Fix DMI matching; intel_quark_i2c_gpio"

* tag 'mfd-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (65 commits)
  mfd: Use dev_get_drvdata() directly
  mfd: cros_ec: Instantiate properly CrOS Touchpad MCU device
  mfd: cros_ec: Instantiate properly CrOS FP MCU device
  mfd: cros_ec: Update the EC feature codes
  mfd: intel-lpss: Add Intel Comet Lake PCI IDs
  mfd: lochnagar: Add links to binding docs for sound and hwmon
  mfd: ab8500-debugfs: Fix a typo ("deubgfs")
  mfd: imx6sx: Add MQS register definition for iomuxc gpr
  dt-bindings: mfd: LMU: Fix lm3632 dt binding example
  mfd: intel_quark_i2c_gpio: Adjust IOT2000 matching
  mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L
  mfd: tps65912-spi: Add missing of table registration
  mfd: axp20x: Add USB power supply mfd cell to AXP803
  mfd: sun6i-prcm: Fix build warning for non-OF configurations
  mfd: intel-lpss: Set the device in reset state when init
  platform/chrome: Add support for v1 of host sleep event
  mfd: cros_ec: Add host_sleep_event_v1 command
  mfd: cros_ec: Instantiate the CrOS USB PD logger driver
  mfd: cs47l90: Make DAC_AEC_CONTROL_2 readable
  mfd: cs47l35: Make DAC_AEC_CONTROL_2 readable
  ...

5 years agoMerge tag 'pci-v5.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Linus Torvalds [Tue, 14 May 2019 17:30:10 +0000 (10:30 -0700)]
Merge tag 'pci-v5.2-changes' of git://git./linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration changes:

   - Add _HPX Type 3 settings support, which gives firmware more
     influence over device configuration (Alexandru Gagniuc)

   - Support fixed bus numbers from bridge Enhanced Allocation
     capabilities (Subbaraya Sundeep)

   - Add "external-facing" DT property to identify cases where we
     require IOMMU protection against untrusted devices (Jean-Philippe
     Brucker)

   - Enable PCIe services for host controller drivers that use managed
     host bridge alloc (Jean-Philippe Brucker)

   - Log PCIe port service messages with pci_dev, not the pcie_device
     (Frederick Lawler)

   - Convert pciehp from pciehp_debug module parameter to generic
     dynamic debug (Frederick Lawler)

  Peer-to-peer DMA:

   - Add whitelist of Root Complexes that support peer-to-peer DMA
     between Root Ports (Christian König)

  Native controller drivers:

   - Add PCI host bridge DMA ranges for bridges that can't DMA
     everywhere, e.g., iProc (Srinath Mannam)

   - Add Amazon Annapurna Labs PCIe host controller driver (Jonathan
     Chocron)

   - Fix Tegra MSI target allocation so DMA doesn't generate unwanted
     MSIs (Vidya Sagar)

   - Fix of_node reference leaks (Wen Yang)

   - Fix Hyper-V module unload & device removal issues (Dexuan Cui)

   - Cleanup R-Car driver (Marek Vasut)

   - Cleanup Keystone driver (Kishon Vijay Abraham I)

   - Cleanup i.MX6 driver (Andrey Smirnov)

  Significant bug fixes:

   - Reset Lenovo ThinkPad P50 GPU so nouveau works after reboot (Lyude
     Paul)

   - Fix Switchtec firmware update performance issue (Wesley Sheng)

   - Work around Pericom switch link retraining erratum (Stefan Mätje)"

* tag 'pci-v5.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (141 commits)
  MAINTAINERS: Add Karthikeyan Mitran and Hou Zhiqiang for Mobiveil PCI
  PCI: pciehp: Remove pointless MY_NAME definition
  PCI: pciehp: Remove pointless PCIE_MODULE_NAME definition
  PCI: pciehp: Remove unused dbg/err/info/warn() wrappers
  PCI: pciehp: Log messages with pci_dev, not pcie_device
  PCI: pciehp: Replace pciehp_debug module param with dyndbg
  PCI: pciehp: Remove pciehp_debug uses
  PCI/AER: Log messages with pci_dev, not pcie_device
  PCI/DPC: Log messages with pci_dev, not pcie_device
  PCI/PME: Replace dev_printk(KERN_DEBUG) with dev_info()
  PCI/AER: Replace dev_printk(KERN_DEBUG) with dev_info()
  PCI: Replace dev_printk(KERN_DEBUG) with dev_info(), etc
  PCI: Replace printk(KERN_INFO) with pr_info(), etc
  PCI: Use dev_printk() when possible
  PCI: Cleanup setup-bus.c comments and whitespace
  PCI: imx6: Allow asynchronous probing
  PCI: dwc: Save root bus for driver remove hooks
  PCI: dwc: Use devm_pci_alloc_host_bridge() to simplify code
  PCI: dwc: Free MSI in dw_pcie_host_init() error path
  PCI: dwc: Free MSI IRQ page in dw_pcie_free_msi()
  ...

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Tue, 14 May 2019 17:10:55 +0000 (10:10 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge misc updates from Andrew Morton:

 - a few misc things and hotfixes

 - ocfs2

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
  kernel/memremap.c: remove the unused device_private_entry_fault() export
  mm: delete find_get_entries_tag
  mm/huge_memory.c: make __thp_get_unmapped_area static
  mm/mprotect.c: fix compilation warning because of unused 'mm' variable
  mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
  mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
  mm/Kconfig: update "Memory Model" help text
  mm/vmscan.c: don't disable irq again when count pgrefill for memcg
  mm: memblock: make keeping memblock memory opt-in rather than opt-out
  hugetlbfs: always use address space in inode for resv_map pointer
  mm/z3fold.c: support page migration
  mm/z3fold.c: add structure for buddy handles
  mm/z3fold.c: improve compression by extending search
  mm/z3fold.c: introduce helper functions
  mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
  mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
  mm/vmscan.c: simplify shrink_inactive_list()
  fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  xen/privcmd-buf.c: convert to use vm_map_pages_zero()
  xen/gntdev.c: convert to use vm_map_pages()
  ...

5 years agokernel/memremap.c: remove the unused device_private_entry_fault() export
Christoph Hellwig [Tue, 14 May 2019 00:23:23 +0000 (17:23 -0700)]
kernel/memremap.c: remove the unused device_private_entry_fault() export

This export has been entirely unused since it was added more than 1 1/2
years ago.

Link: http://lkml.kernel.org/r/20190429115535.12793-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: delete find_get_entries_tag
Matthew Wilcox (Oracle) [Tue, 14 May 2019 00:23:20 +0000 (17:23 -0700)]
mm: delete find_get_entries_tag

I removed the only user of this and hadn't noticed it was now unused.

Link: http://lkml.kernel.org/r/20190430152929.21813-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/huge_memory.c: make __thp_get_unmapped_area static
Bharath Vedartham [Tue, 14 May 2019 00:23:17 +0000 (17:23 -0700)]
mm/huge_memory.c: make __thp_get_unmapped_area static

__thp_get_unmapped_area is only used in mm/huge_memory.c.  Make it static.
Tested by building and booting the kernel.

Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/mprotect.c: fix compilation warning because of unused 'mm' variable
Mike Rapoport [Tue, 14 May 2019 00:23:14 +0000 (17:23 -0700)]
mm/mprotect.c: fix compilation warning because of unused 'mm' variable

Since 0cbe3e26abe0 ("mm: update ptep_modify_prot_start/commit to take
vm_area_struct as arg") the only place that uses the local 'mm' variable
in change_pte_range() is the call to set_pte_at().

Many architectures define set_pte_at() as macro that does not use the 'mm'
parameter, which generates the following compilation warning:

 CC      mm/mprotect.o
mm/mprotect.c: In function 'change_pte_range':
mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
  struct mm_struct *mm = vma->vm_mm;
                    ^~

Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
variable in change_pte_range().

[liu.song.a23@gmail.com: fix missed conversions]
Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink:
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/page-writeback: introduce tracepoint for wait_on_page_writeback()
Yafang Shao [Tue, 14 May 2019 00:23:11 +0000 (17:23 -0700)]
mm/page-writeback: introduce tracepoint for wait_on_page_writeback()

Recently there have been some hung tasks on our server due to
wait_on_page_writeback(), and we want to know the details of this
PG_writeback, i.e.  this page is writing back to which device.  But it is
not so convenient to get the details.

I think it would be better to introduce a tracepoint for diagnosing the
writeback details.

Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
Yafang Shao [Tue, 14 May 2019 00:23:08 +0000 (17:23 -0700)]
mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags

trace_reclaim_flags and trace_shrink_flags are almost the same.
We can simplify them to avoid redundant code.

Link: http://lkml.kernel.org/r/1556169203-5858-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/Kconfig: update "Memory Model" help text
Mike Rapoport [Tue, 14 May 2019 00:23:05 +0000 (17:23 -0700)]
mm/Kconfig: update "Memory Model" help text

The help describing the memory model selection is outdated.  It still says
that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
SPARSEMEM.

Update the help text for the relevant options:
* add a generic help for the "Memory Model" prompt
* add description for FLATMEM
* reduce the description of DISCONTIGMEM and add a deprecation note
* prefer SPARSEMEM over DISCONTIGMEM

Link: http://lkml.kernel.org/r/1556188531-20728-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmscan.c: don't disable irq again when count pgrefill for memcg
Yafang Shao [Tue, 14 May 2019 00:23:02 +0000 (17:23 -0700)]
mm/vmscan.c: don't disable irq again when count pgrefill for memcg

We can use __count_memcg_events() directly because this callsite is alreay
protected by spin_lock_irq().

Link: http://lkml.kernel.org/r/1556093494-30798-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memblock: make keeping memblock memory opt-in rather than opt-out
Mike Rapoport [Tue, 14 May 2019 00:22:59 +0000 (17:22 -0700)]
mm: memblock: make keeping memblock memory opt-in rather than opt-out

Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agohugetlbfs: always use address space in inode for resv_map pointer
Mike Kravetz [Tue, 14 May 2019 00:22:55 +0000 (17:22 -0700)]
hugetlbfs: always use address space in inode for resv_map pointer

Continuing discussion about 58b6e5e8f1ad ("hugetlbfs: fix memory leak for
resv_map") brought up the issue that inode->i_mapping may not point to the
address space embedded within the inode at inode eviction time.  The
hugetlbfs truncate routine handles this by explicitly using inode->i_data.
However, code cleaning up the resv_map will still use the address space
pointed to by inode->i_mapping.  Luckily, private_data is NULL for address
spaces in all such cases today but, there is no guarantee this will
continue.

Change all hugetlbfs code getting a resv_map pointer to explicitly get it
from the address space embedded within the inode.  In addition, add more
comments in the code to indicate why this is being done.

Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Yufen Yu <yuyufen@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/z3fold.c: support page migration
Vitaly Wool [Tue, 14 May 2019 00:22:52 +0000 (17:22 -0700)]
mm/z3fold.c: support page migration

Now that we are not using page address in handles directly, we can make
z3fold pages movable to decrease the memory fragmentation z3fold may
create over time.

This patch starts advertising non-headless z3fold pages as movable and
uses the existing kernel infrastructure to implement moving of such pages
per memory management subsystem's request.  It thus implements 3 required
callbacks for page migration:

* isolation callback: z3fold_page_isolate(): try to isolate the page by
  removing it from all lists.  Pages scheduled for some activity and
  mapped pages will not be isolated.  Return true if isolation was
  successful or false otherwise

* migration callback: z3fold_page_migrate(): re-check critical
  conditions and migrate page contents to the new page provided by the
  memory subsystem.  Returns 0 on success or negative error code otherwise

* putback callback: z3fold_page_putback(): put back the page if
  z3fold_page_migrate() for it failed permanently (i.  e.  not with
  -EAGAIN code).

[lkp@intel.com: z3fold_page_isolate() can be static]
Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42
Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/z3fold.c: add structure for buddy handles
Vitaly Wool [Tue, 14 May 2019 00:22:49 +0000 (17:22 -0700)]
mm/z3fold.c: add structure for buddy handles

For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles.  Instead,
it will create abstract handles (3 per page) which will contain pointers
to z3fold objects.  Thus, it will be possible to change these pointers
when z3fold page is moved.

Link: http://lkml.kernel.org/r/20190417103826.484eaf18c1294d682769880f@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/z3fold.c: improve compression by extending search
Vitaly Wool [Tue, 14 May 2019 00:22:46 +0000 (17:22 -0700)]
mm/z3fold.c: improve compression by extending search

The current z3fold implementation only searches this CPU's page lists for
a fitting page to put a new object into.  This patch adds quick search for
very well fitting pages (i.  e.  those having exactly the required number
of free space) on other CPUs too, before allocating a new page for that
object.

Link: http://lkml.kernel.org/r/20190417103733.72ae81abe1552397c95a008e@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/z3fold.c: introduce helper functions
Vitaly Wool [Tue, 14 May 2019 00:22:43 +0000 (17:22 -0700)]
mm/z3fold.c: introduce helper functions

Patch series "z3fold: support page migration", v2.

This patchset implements page migration support and slightly better buddy
search.  To implement page migration support, z3fold has to move away from
the current scheme of handle encoding.  i.  e.  stop encoding page address
in handles.  Instead, a small per-page structure is created which will
contain actual addresses for z3fold objects, while pointers to fields of
that structure will be used as handles.

Thus, it will be possible to change the underlying addresses to reflect
page migration.

To support migration itself, 3 callbacks will be implemented:

1: isolation callback: z3fold_page_isolate(): try to isolate the page
   by removing it from all lists.  Pages scheduled for some activity and
   mapped pages will not be isolated.  Return true if isolation was
   successful or false otherwise

2: migration callback: z3fold_page_migrate(): re-check critical
   conditions and migrate page contents to the new page provided by the
   system.  Returns 0 on success or negative error code otherwise

3: putback callback: z3fold_page_putback(): put back the page if
   z3fold_page_migrate() for it failed permanently (i.  e.  not with
   -EAGAIN code).

To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration compaction,
if migration was successful, or by z3fold_page_putback() in the other
case.

Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.

This patch (of 4):

Introduce a separate helper function for object allocation, as well as 2
smaller helpers to add a buddy to the list and to get a pointer to the
pool from the z3fold header.  No functional changes here.

Link: http://lkml.kernel.org/r/20190417103633.a4bb770b5bf0fb7e43ce1666@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
Yafang Shao [Tue, 14 May 2019 00:22:40 +0000 (17:22 -0700)]
mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist

Because rmqueue_pcplist() is only called when order is 0, we don't need to
use order as a parameter.

Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
Jérôme Glisse [Tue, 14 May 2019 00:22:37 +0000 (17:22 -0700)]
mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig

Add 2 new Kconfig variables that are not used by anyone.  I check that
various make ARCH=somearch allmodconfig do work and do not complain.  This
new Kconfig needs to be added first so that device drivers that depend on
HMM can be updated.

Once drivers are updated then I can update the HMM Kconfig to depend on
this new Kconfig in a followup patch.

This is about solving Kconfig for HMM given that device driver are
going through their own tree we want to avoid changing them from the mm
tree.  So plan is:

1 - Kernel release N add the new Kconfig to mm/Kconfig (this patch)
2 - Kernel release N+1 update driver to depend on new Kconfig ie
    stop using ARCH_HASH_HMM and start using ARCH_HAS_HMM_MIRROR
    and ARCH_HAS_HMM_DEVICE (one or the other or both depending
    on the driver)
3 - Kernel release N+2 remove ARCH_HASH_HMM and do final Kconfig
    update in mm/Kconfig

Link: http://lkml.kernel.org/r/20190417211141.17580-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/vmscan.c: simplify shrink_inactive_list()
Kirill Tkhai [Tue, 14 May 2019 00:22:33 +0000 (17:22 -0700)]
mm/vmscan.c: simplify shrink_inactive_list()

This merges together duplicated patterns of code.  Also, replace
count_memcg_events() with its irq-careless namesake, because they are
already called in interrupts disabled context.

Link: http://lkml.kernel.org/r/2ece1df4-2989-bc9b-6172-61e9fdde5bfd@virtuozzo.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
Amir Goldstein [Tue, 14 May 2019 00:22:30 +0000 (17:22 -0700)]
fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true.  It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was the
reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
requested to wait for in-flight IO as well as to writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
WB_SYNC_ALL writeback to perform the full range sync request.

Link: http://lkml.kernel.org/r/20190409114922.30095-1-amir73il@gmail.com
Link: http://lkml.kernel.org/r/20190419072938.31320-1-amir73il@gmail.com
Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoxen/privcmd-buf.c: convert to use vm_map_pages_zero()
Souptick Joarder [Tue, 14 May 2019 00:22:27 +0000 (17:22 -0700)]
xen/privcmd-buf.c: convert to use vm_map_pages_zero()

Convert to use vm_map_pages_zero() to map range of kernel memory to user
vma.

This driver has ignored vm_pgoff.  We could later "fix" these drivers to
behave according to the normal vm_pgoff offsetting simply by removing the
_zero suffix on the function name and if that causes regressions, it gives
us an easy way to revert.

Link: http://lkml.kernel.org/r/acf678e81d554d01a9b590716ac0ccbdcdf71c25.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoxen/gntdev.c: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:23 +0000 (17:22 -0700)]
xen/gntdev.c: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

map->count is passed to vm_map_pages() and internal API verify map->count
against count ( count = vma_pages(vma)) for page array boundary overrun
condition.

Link: http://lkml.kernel.org/r/88e56e82d2db98705c2d842e9c9806c00b366d67.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agovideobuf2/videobuf2-dma-sg.c: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:19 +0000 (17:22 -0700)]
videobuf2/videobuf2-dma-sg.c: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

vm_pgoff is treated in V4L2 API as a 'cookie' to select a buffer, not as a
in-buffer offset by design and it always want to mmap a whole buffer from
its beginning.

Link: http://lkml.kernel.org/r/a953fe6b3056de1cc6eab654effdd4a22f125375.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoiommu/dma-iommu.c: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:15 +0000 (17:22 -0700)]
iommu/dma-iommu.c: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

Link: http://lkml.kernel.org/r/80c3d220fc6ada73a88ce43ca049afb55a889258.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrm/xen/xen_drm_front_gem.c: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:11 +0000 (17:22 -0700)]
drm/xen/xen_drm_front_gem.c: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

Link: http://lkml.kernel.org/r/ff8e10ba778d79419c66ee8215bccf01560540fd.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrm/rockchip/rockchip_drm_gem.c: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:07 +0000 (17:22 -0700)]
drm/rockchip/rockchip_drm_gem.c: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

Tested on Rockchip hardware and display is working, including talking to
Lima via prime.

Link: http://lkml.kernel.org/r/7ba359eb1aceac388d05983c1f29b915bdf291f9.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrivers/firewire/core-iso.c: convert to use vm_map_pages_zero()
Souptick Joarder [Tue, 14 May 2019 00:22:03 +0000 (17:22 -0700)]
drivers/firewire/core-iso.c: convert to use vm_map_pages_zero()

Convert to use vm_map_pages_zero() to map range of kernel memory to user
vma.

This driver has ignored vm_pgoff and mapped the entire pages.  We could
later "fix" these drivers to behave according to the normal vm_pgoff
offsetting simply by removing the _zero suffix on the function name and if
that causes regressions, it gives us an easy way to revert.

Link: http://lkml.kernel.org/r/88645f5ea8202784a8baaf389e592aeb8c505e8e.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoarm: mm: dma-mapping: convert to use vm_map_pages()
Souptick Joarder [Tue, 14 May 2019 00:22:00 +0000 (17:22 -0700)]
arm: mm: dma-mapping: convert to use vm_map_pages()

Convert to use vm_map_pages() to map range of kernel memory to user vma.

Link: http://lkml.kernel.org/r/936e5e107c746a7310e3a3c471188ca3ac8f9754.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Heiko Stuebner <heiko@sntech.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: introduce new vm_map_pages() and vm_map_pages_zero() API
Souptick Joarder [Tue, 14 May 2019 00:21:56 +0000 (17:21 -0700)]
mm: introduce new vm_map_pages() and vm_map_pages_zero() API

Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

This patch (of 5):

Previouly drivers have their own way of mapping range of kernel
pages/memory into user vma and this was done by invoking vm_insert_page()
within a loop.

As this pattern is common across different drivers, it can be generalized
by creating new functions and using them across the drivers.

vm_map_pages() is the API which can be used to map kernel memory/pages in
drivers which have considered vm_pgoff

vm_map_pages_zero() is the API which can be used to map a range of kernel
memory/pages in drivers which have not considered vm_pgoff.  vm_pgoff is
passed as default 0 for those drivers.

We _could_ then at a later "fix" these drivers which are using
vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
simply by removing the _zero suffix on the function name and if that
causes regressions, it gives us an easy way to revert.

Tested on Rockchip hardware and display is working, including talking to
Lima via prime.

Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Suggested-by: Russell King <linux@armlinux.org.uk>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: remove redundant 'default n' from Kconfig-s
Bartlomiej Zolnierkiewicz [Tue, 14 May 2019 00:21:53 +0000 (17:21 -0700)]
mm: remove redundant 'default n' from Kconfig-s

'default n' is the default value for any bool or tristate Kconfig
setting so there is no need to write it explicitly.

Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
is not set' for visible symbols") the Kconfig behavior is the same
regardless of 'default n' being present or not:

    ...
    One side effect of (and the main motivation for) this change is making
    the following two definitions behave exactly the same:

        config FOO
                bool

        config FOO
                bool
                default n

    With this change, neither of these will generate a
    '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
    That might make it clearer to people that a bare 'default n' is
    redundant.
    ...

Link: http://lkml.kernel.org/r/c3385916-e4d4-37d3-b330-e6b7dff83a52@samsung.com
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: fix false-positive OVERCOMMIT_GUESS failures
Johannes Weiner [Tue, 14 May 2019 00:21:50 +0000 (17:21 -0700)]
mm: fix false-positive OVERCOMMIT_GUESS failures

With the default overcommit==guess we occasionally run into mmap
rejections despite plenty of memory that would get dropped under
pressure but just isn't accounted reclaimable. One example of this is
dying cgroups pinned by some page cache. A previous case was auxiliary
path name memory associated with dentries; we have since annotated
those allocations to avoid overcommit failures (see d79f7aa496fc ("mm:
treat indirectly reclaimable memory as free in overcommit logic")).

But trying to classify all allocated memory reliably as reclaimable
and unreclaimable is a bit of a fool's errand. There could be a myriad
of dependencies that constantly change with kernel versions.

It becomes even more questionable of an effort when considering how
this estimate of available memory is used: it's not compared to the
system-wide allocated virtual memory in any way. It's not even
compared to the allocating process's address space. It's compared to
the single allocation request at hand!

So we have an elaborate left-hand side of the equation that tries to
assess the exact breathing room the system has available down to a
page - and then compare it to an isolated allocation request with no
additional context. We could fail an allocation of N bytes, but for
two allocations of N/2 bytes we'd do this elaborate dance twice in a
row and then still let N bytes of virtual memory through. This doesn't
make a whole lot of sense.

Let's take a step back and look at the actual goal of the
heuristic. From the documentation:

   Heuristic overcommit handling. Obvious overcommits of address
   space are refused. Used for a typical system. It ensures a
   seriously wild allocation fails while allowing overcommit to
   reduce swap usage.  root is allowed to allocate slightly more
   memory in this mode. This is the default.

If all we want to do is catch clearly bogus allocation requests
irrespective of the general virtual memory situation, the physical
memory counter-part doesn't need to be that complicated, either.

When in GUESS mode, catch wild allocations by comparing their request
size to total amount of ram and swap in the system.

Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail
David Hildenbrand [Tue, 14 May 2019 00:21:46 +0000 (17:21 -0700)]
mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail

All callers of arch_remove_memory() ignore errors.  And we should really
try to remove any errors from the memory removal path.  No more errors are
reported from __remove_pages().  BUG() in s390x code in case
arch_remove_memory() is triggered.  We may implement that properly later.
WARN in case powerpc code failed to remove the section mapping, which is
better than ignoring the error completely right now.

Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug: make __remove_section() never fail
David Hildenbrand [Tue, 14 May 2019 00:21:41 +0000 (17:21 -0700)]
mm/memory_hotplug: make __remove_section() never fail

Let's just warn in case a section is not valid instead of failing to
remove somewhere in the middle of the process, returning an error that
will be mostly ignored by callers.

Link: http://lkml.kernel.org/r/20190409100148.24703-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug: make unregister_memory_section() never fail
David Hildenbrand [Tue, 14 May 2019 00:21:37 +0000 (17:21 -0700)]
mm/memory_hotplug: make unregister_memory_section() never fail

Failing while removing memory is mostly ignored and cannot really be
handled.  Let's treat errors in unregister_memory_section() in a nice way,
warning, but continuing.

Link: http://lkml.kernel.org/r/20190409100148.24703-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/memory_hotplug: release memory resource after arch_remove_memory()
David Hildenbrand [Tue, 14 May 2019 00:21:33 +0000 (17:21 -0700)]
mm/memory_hotplug: release memory resource after arch_remove_memory()

Patch series "mm/memory_hotplug: Better error handling when removing
memory", v1.

Error handling when removing memory is somewhat messed up right now.  Some
errors result in warnings, others are completely ignored.  Memory unplug
code can essentially not deal with errors properly as of now.
remove_memory() will never fail.

We have basically two choices:
1. Allow arch_remov_memory() and friends to fail, propagating errors via
   remove_memory(). Might be problematic (e.g. DIMMs consisting of multiple
   pieces added/removed separately).
2. Don't allow the functions to fail, handling errors in a nicer way.

It seems like most errors that can theoretically happen are really corner
cases and mostly theoretical (e.g.  "section not valid").  However e.g.
aborting removal of sections while all callers simply continue in case of
errors is not nice.

If we can gurantee that removal of memory always works (and WARN/skip in
case of theoretical errors so we can figure out what is going on), we can
go ahead and implement better error handling when adding memory.

E.g. via add_memory():

arch_add_memory()
ret = do_stuff()
if (ret) {
arch_remove_memory();
goto error;
}

Handling here that arch_remove_memory() might fail is basically
impossible.  So I suggest, let's avoid reporting errors while removing
memory, warning on theoretical errors instead and continuing instead of
aborting.

This patch (of 4):

__add_pages() doesn't add the memory resource, so __remove_pages()
shouldn't remove it.  Let's factor it out.  Especially as it is a special
case for memory used as system memory, added via add_memory() and friends.

We now remove the resource after removing the sections instead of doing it
the other way around.  I don't think this change is problematic.

add_memory()
register memory resource
arch_add_memory()

remove_memory
arch_remove_memory()
release memory resource

While at it, explain why we ignore errors and that it only happeny if
we remove memory in a different granularity as we added it.

[david@redhat.com: fix printk warning]
Link: http://lkml.kernel.org/r/20190417120204.6997-1-david@redhat.com
Link: http://lkml.kernel.org/r/20190409100148.24703-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Andrew Banman <andrew.banman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/filemap.c: fix minor typo
Laurent Dufour [Tue, 14 May 2019 00:21:29 +0000 (17:21 -0700)]
mm/filemap.c: fix minor typo

Link: http://lkml.kernel.org/r/20190304155240.19215-1-ldufour@linux.ibm.com
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>