David S. Miller [Sat, 2 Jun 2018 12:07:52 +0000 (08:07 -0400)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-02
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) BPF uapi fix in struct bpf_prog_info and struct bpf_map_info in
order to fix offsets on 32 bit archs.
This will have a minor merge conflict with net-next which has the
__u32 gpl_compatible:1 bitfield in struct bpf_prog_info at this
location. Resolution is to use the gpl_compatible member.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sat, 2 Jun 2018 03:21:59 +0000 (05:21 +0200)]
bpf: fix uapi hole for 32 bit compat applications
In 64 bit, we have a 4 byte hole between ifindex and netns_dev in the
case of struct bpf_map_info but also struct bpf_prog_info. In net-next
commit
b85fab0e67b ("bpf: Add gpl_compatible flag to struct bpf_prog_info")
added a bitfield into it to expose some flags related to programs. Thus,
add an unnamed __u32 bitfield for both so that alignment keeps the same
in both 32 and 64 bit cases, and can be naturally extended from there
as in
b85fab0e67b.
Before:
# file test.o
test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
# pahole test.o
struct bpf_map_info {
__u32 type; /* 0 4 */
__u32 id; /* 4 4 */
__u32 key_size; /* 8 4 */
__u32 value_size; /* 12 4 */
__u32 max_entries; /* 16 4 */
__u32 map_flags; /* 20 4 */
char name[16]; /* 24 16 */
__u32 ifindex; /* 40 4 */
__u64 netns_dev; /* 44 8 */
__u64 netns_ino; /* 52 8 */
/* size: 64, cachelines: 1, members: 10 */
/* padding: 4 */
};
After (same as on 64 bit):
# file test.o
test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
# pahole test.o
struct bpf_map_info {
__u32 type; /* 0 4 */
__u32 id; /* 4 4 */
__u32 key_size; /* 8 4 */
__u32 value_size; /* 12 4 */
__u32 max_entries; /* 16 4 */
__u32 map_flags; /* 20 4 */
char name[16]; /* 24 16 */
__u32 ifindex; /* 40 4 */
/* XXX 4 bytes hole, try to pack */
__u64 netns_dev; /* 48 8 */
__u64 netns_ino; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
/* size: 64, cachelines: 1, members: 10 */
/* sum members: 60, holes: 1, sum holes: 4 */
};
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps")
Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniele Palmas [Thu, 31 May 2018 09:18:29 +0000 (11:18 +0200)]
net: usb: cdc_mbim: add flag FLAG_SEND_ZLP
Testing Telit LM940 with ICMP packets > 14552 bytes revealed that
the modem needs FLAG_SEND_ZLP to properly work, otherwise the cdc
mbim data interface won't be anymore responsive.
Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jun 2018 17:56:31 +0000 (13:56 -0400)]
Merge branch 'tunnel-mtus'
Nicolas Dichtel says:
====================
ip[6] tunnels: fix mtu calculations
The first patch restores the possibility to bind an ip4 tunnel to an
interface whith a large mtu.
The second patch was spotted after the first fix. I also target it to net
because it fixes the max mtu value that can be used for ipv6 tunnels.
v2: remove the 0xfff8 in ip_tunnel_newlink()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Thu, 31 May 2018 08:59:33 +0000 (10:59 +0200)]
ip6_tunnel: remove magic mtu value 0xFFF8
I don't know where this value comes from (probably a copy and paste and
paste and paste ...).
Let's use standard values which are a bit greater.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Thu, 31 May 2018 08:59:32 +0000 (10:59 +0200)]
ip_tunnel: restore binding to ifaces with a large mtu
After commit
f6cc9c054e77, the following conf is broken (note that the
default loopback mtu is 65536, ie IP_MAX_MTU + 1):
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev lo
add tunnel "gre0" failed: Invalid argument
$ ip l a type dummy
$ ip l s dummy1 up
$ ip l s dummy1 mtu 65535
$ ip tunnel add gre1 mode gre local 10.125.0.1 remote 10.125.0.2 dev dummy1
add tunnel "gre0" failed: Invalid argument
dev_set_mtu() doesn't allow to set a mtu which is too large.
First, let's cap the mtu returned by ip_tunnel_bind_dev(). Second, remove
the magic value 0xFFF8 and use IP_MAX_MTU instead.
0xFFF8 seems to be there for ages, I don't know why this value was used.
With a recent kernel, it's also possible to set a mtu > IP_MAX_MTU:
$ ip l s dummy1 mtu 66000
After that patch, it's also possible to bind an ip tunnel on that kind of
interface.
CC: Petr Machata <petrm@mellanox.com>
CC: Ido Schimmel <idosch@mellanox.com>
Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
Fixes: f6cc9c054e77 ("ip_tunnel: Emit events for post-register MTU changes")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 1 Jun 2018 17:25:41 +0000 (13:25 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2018-05-31
1) Avoid possible overflow of the offset variable
in _decode_session6(), this fixes an infinite
lookp there. From Eric Dumazet.
2) We may use an error pointer in the error path of
xfrm_bundle_create(). Fix this by returning this
pointer directly to the caller.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Damien Thébault [Thu, 31 May 2018 07:04:01 +0000 (07:04 +0000)]
net: dsa: b53: Add BCM5389 support
This patch adds support for the BCM5389 switch connected through MDIO.
Signed-off-by: Damien Thébault <damien.thebault@vitec.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai [Fri, 1 Jun 2018 11:30:38 +0000 (14:30 +0300)]
kcm: Fix use-after-free caused by clonned sockets
(resend for properly queueing in patchwork)
kcm_clone() creates kernel socket, which does not take net counter.
Thus, the net may die before the socket is completely destructed,
i.e. kcm_exit_net() is executed before kcm_done().
Reported-by: syzbot+5f1a04e374a635efc426@syzkaller.appspotmail.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 31 May 2018 19:59:46 +0000 (15:59 -0400)]
net-sysfs: Fix memory leak in XPS configuration
This patch reorders the error cases in showing the XPS configuration so
that we hold off on memory allocation until after we have verified that we
can support XPS on a given ring.
Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ondřej Hlavatý [Thu, 31 May 2018 21:21:04 +0000 (23:21 +0200)]
ixgbe: fix parsing of TC actions for HW offload
The previous code was optimistic, accepting the offload of whole action
chain when there was a single known action (drop/redirect). This results
in offloading a rule which should not be offloaded, because its behavior
cannot be reproduced in the hardware.
For example:
$ tc filter add dev eno1 parent ffff: protocol ip \
u32 ht 800: order 1 match tcp src 42 FFFF \
action mirred egress mirror dev enp1s16 pipe \
drop
The controller is unable to mirror the packet to a VF, but still
offloads the rule by dropping the packet.
Change the approach of the function to a pessimistic one, rejecting the
chain when an unknown action is found. This is better suited for future
extensions.
Note that both recognized actions always return TC_ACT_SHOT, therefore
it is safe to ignore actions behind them.
Signed-off-by: Ondřej Hlavatý <ohlavaty@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Thu, 31 May 2018 06:44:49 +0000 (09:44 +0300)]
net: ethernet: davinci_emac: fix error handling in probe()
The current error handling code has an issue where it does:
if (priv->txchan)
cpdma_chan_destroy(priv->txchan);
The problem is that ->txchan is either valid or an error pointer (which
would lead to an Oops). I've changed it to use multiple error labels so
that the test can be removed.
Also there were some missing calls to netif_napi_del().
Fixes: 3ef0fdb2342c ("net: davinci_emac: switch to new cpdma layer")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Samuel Mendoza-Jonas [Thu, 31 May 2018 04:10:04 +0000 (14:10 +1000)]
net/ncsi: Fix array size in dumpit handler
With CONFIG_CC_STACKPROTECTOR enabled the kernel panics as below when
parsing a NCSI_CMD_PKG_INFO command:
[ 150.149711] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
805cff08
[ 150.149711]
[ 150.159919] CPU: 0 PID: 1301 Comm: ncsi-netlink Not tainted 4.13.16-
468cbec6d2c91239332cb91b1f0a73aafcb6f0c6 #1
[ 150.170004] Hardware name: Generic DT based system
[ 150.174852] [<
80109930>] (unwind_backtrace) from [<
80106bc4>] (show_stack+0x20/0x24)
[ 150.182641] [<
80106bc4>] (show_stack) from [<
805d36e4>] (dump_stack+0x20/0x28)
[ 150.189888] [<
805d36e4>] (dump_stack) from [<
801163ac>] (panic+0xdc/0x278)
[ 150.196780] [<
801163ac>] (panic) from [<
801162cc>] (__stack_chk_fail+0x20/0x24)
[ 150.204111] [<
801162cc>] (__stack_chk_fail) from [<
805cff08>] (ncsi_pkg_info_all_nl+0x244/0x258)
[ 150.212912] [<
805cff08>] (ncsi_pkg_info_all_nl) from [<
804f939c>] (genl_lock_dumpit+0x3c/0x54)
[ 150.221535] [<
804f939c>] (genl_lock_dumpit) from [<
804f873c>] (netlink_dump+0xf8/0x284)
[ 150.229550] [<
804f873c>] (netlink_dump) from [<
804f8d44>] (__netlink_dump_start+0x124/0x17c)
[ 150.237992] [<
804f8d44>] (__netlink_dump_start) from [<
804f9880>] (genl_rcv_msg+0x1c8/0x3d4)
[ 150.246440] [<
804f9880>] (genl_rcv_msg) from [<
804f9174>] (netlink_rcv_skb+0xd8/0x134)
[ 150.254361] [<
804f9174>] (netlink_rcv_skb) from [<
804f96a4>] (genl_rcv+0x30/0x44)
[ 150.261850] [<
804f96a4>] (genl_rcv) from [<
804f7790>] (netlink_unicast+0x198/0x234)
[ 150.269511] [<
804f7790>] (netlink_unicast) from [<
804f7ffc>] (netlink_sendmsg+0x368/0x3b0)
[ 150.277783] [<
804f7ffc>] (netlink_sendmsg) from [<
804abea4>] (sock_sendmsg+0x24/0x34)
[ 150.285625] [<
804abea4>] (sock_sendmsg) from [<
804ac1dc>] (___sys_sendmsg+0x244/0x260)
[ 150.293556] [<
804ac1dc>] (___sys_sendmsg) from [<
804ad98c>] (__sys_sendmsg+0x5c/0x9c)
[ 150.301400] [<
804ad98c>] (__sys_sendmsg) from [<
804ad9e4>] (SyS_sendmsg+0x18/0x1c)
[ 150.308984] [<
804ad9e4>] (SyS_sendmsg) from [<
80102640>] (ret_fast_syscall+0x0/0x3c)
[ 150.316743] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
805cff08
This turns out to be because the attrs array in ncsi_pkg_info_all_nl()
is initialised to a length of NCSI_ATTR_MAX which is the maximum
attribute number, not the number of attributes.
Fixes: 955dc68cb9b2 ("net/ncsi: Add generic netlink family")
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 31 May 2018 19:27:39 +0000 (15:27 -0400)]
Merge tag 'wireless-drivers-for-davem-2018-05-30' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.17
Two last minute fixes, hopefully they make it to 4.17 still.
rt2x00
* revert a fix which caused even more problems
iwlwifi
* fix a crash when there are 16 or more logical CPUs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Blakey [Wed, 30 May 2018 08:29:15 +0000 (11:29 +0300)]
cls_flower: Fix incorrect idr release when failing to modify rule
When we fail to modify a rule, we incorrectly release the idr handle
of the unmodified old rule.
Fix that by checking if we need to release it.
Fixes: fe2502e49b58 ("net_sched: remove cls_flower idr on failure")
Reported-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Finn Thain [Wed, 30 May 2018 03:03:51 +0000 (13:03 +1000)]
net/sonic: Use dma_mapping_error()
With CONFIG_DMA_API_DEBUG=y, calling sonic_open() produces the
message, "DMA-API: device driver failed to check map error".
Add the missing dma_mapping_error() call.
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Thu, 31 May 2018 07:45:18 +0000 (09:45 +0200)]
xfrm Fix potential error pointer dereference in xfrm_bundle_create.
We may derference an invalid pointer in the error path of
xfrm_bundle_create(). Fix this by returning this error
pointer directly instead of assigning it to xdst0.
Fixes: 45b018beddb6 ("ipsec: Create and use new helpers for dst child access.")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Jason Wang [Tue, 29 May 2018 06:18:19 +0000 (14:18 +0800)]
vhost_net: flush batched heads before trying to busy polling
After commit
e2b3b35eb989 ("vhost_net: batch used ring update in rx"),
we tend to batch updating used heads. But it doesn't flush batched
heads before trying to do busy polling, this will cause vhost to wait
for guest TX which waits for the used RX. Fixing by flush batched
heads before busy loop.
1 byte TCP_RR performance recovers from 13107.83 to 50402.65.
Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Mon, 28 May 2018 10:37:49 +0000 (19:37 +0900)]
tun: Fix NULL pointer dereference in XDP redirect
Calling XDP redirection requires bh disabled. Softirq can call another
XDP function and redirection functions, then the percpu static variable
ri->map can be overwritten to NULL.
This is a generic XDP case called from tun.
[ 3535.736058] BUG: unable to handle kernel NULL pointer dereference at
0000000000000018
[ 3535.743974] PGD 0 P4D 0
[ 3535.746530] Oops: 0000 [#1] SMP PTI
[ 3535.750049] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel crypto_simd cryptd enclosure hpwdt hpilo glue_helper ipmi_si pcspkr wmi mei_me ioatdma mei ipmi_devintf shpchp dca ipmi_msghandler lpc_ich acpi_power_meter sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm smartpqi i40e crc32c_intel scsi_transport_sas tg3 i2c_core ptp pps_core
[ 3535.813456] CPU: 5 PID: 1630 Comm: vhost-1614 Not tainted 4.17.0-rc4 #2
[ 3535.820127] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
[ 3535.828732] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 3535.833740] RSP: 0018:
ffffb4bc47bf7c58 EFLAGS:
00010246
[ 3535.839009] RAX:
ffff9fdfcfea1c40 RBX:
0000000000000000 RCX:
ffff9fdf27fe3100
[ 3535.846205] RDX:
ffff9fdfca769200 RSI:
0000000000000000 RDI:
0000000000000000
[ 3535.853402] RBP:
ffffb4bc491d9000 R08:
00000000000045ad R09:
0000000000000ec0
[ 3535.860597] R10:
0000000000000001 R11:
ffff9fdf26c3ce4e R12:
ffff9fdf9e72c000
[ 3535.867794] R13:
0000000000000000 R14:
fffffffffffffff2 R15:
ffff9fdfc82cdd00
[ 3535.874990] FS:
0000000000000000(0000) GS:
ffff9fdfcfe80000(0000) knlGS:
0000000000000000
[ 3535.883152] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 3535.888948] CR2:
0000000000000018 CR3:
0000000bde724004 CR4:
00000000007626e0
[ 3535.896145] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 3535.903342] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 3535.910538] PKRU:
55555554
[ 3535.913267] Call Trace:
[ 3535.915736] xdp_do_generic_redirect+0x7a/0x310
[ 3535.920310] do_xdp_generic.part.117+0x285/0x370
[ 3535.924970] tun_get_user+0x5b9/0x1260 [tun]
[ 3535.929279] tun_sendmsg+0x52/0x70 [tun]
[ 3535.933237] handle_tx+0x2ad/0x5f0 [vhost_net]
[ 3535.937721] vhost_worker+0xa5/0x100 [vhost]
[ 3535.942030] kthread+0xf5/0x130
[ 3535.945198] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 3535.950031] ? kthread_bind+0x10/0x10
[ 3535.953727] ret_from_fork+0x35/0x40
[ 3535.957334] Code: 0e 74 15 83 f8 10 75 05 e9 49 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 29 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 49 a9 b3 ff 31 c0 c3
[ 3535.976387] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP:
ffffb4bc47bf7c58
[ 3535.982883] CR2:
0000000000000018
[ 3535.987096] ---[ end trace
383b299dd1430240 ]---
[ 3536.131325] Kernel panic - not syncing: Fatal exception
[ 3536.137484] Kernel Offset: 0x26a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3536.281406] ---[ end Kernel panic - not syncing: Fatal exception ]---
And a kernel with generic case fixed still panics in tun driver XDP
redirect, because it disabled only preemption, but not bh.
[ 2055.128746] BUG: unable to handle kernel NULL pointer dereference at
0000000000000018
[ 2055.136662] PGD 0 P4D 0
[ 2055.139219] Oops: 0000 [#1] SMP PTI
[ 2055.142736] Modules linked in: vhost_net vhost tap tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc vfat fat ext4 mbcache jbd2 intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc ses aesni_intel ipmi_ssif crypto_simd enclosure cryptd hpwdt glue_helper ioatdma hpilo wmi dca pcspkr ipmi_si acpi_power_meter ipmi_devintf shpchp mei_me ipmi_msghandler mei lpc_ich sch_fq_codel ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e smartpqi tg3 scsi_transport_sas crc32c_intel i2c_core ptp pps_core
[ 2055.206142] CPU: 6 PID: 1693 Comm: vhost-1683 Tainted: G W 4.17.0-rc5-fix-tun+ #1
[ 2055.215011] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 11/14/2017
[ 2055.223617] RIP: 0010:__xdp_map_lookup_elem+0x5/0x30
[ 2055.228624] RSP: 0018:
ffff998b07607cc0 EFLAGS:
00010246
[ 2055.233892] RAX:
ffff8dbd8e235700 RBX:
ffff8dbd8ff21c40 RCX:
0000000000000004
[ 2055.241089] RDX:
ffff998b097a9000 RSI:
0000000000000000 RDI:
0000000000000000
[ 2055.248286] RBP:
0000000000000000 R08:
00000000000065a8 R09:
0000000000005d80
[ 2055.255483] R10:
0000000000000040 R11:
ffff8dbcf0100000 R12:
ffff998b097a9000
[ 2055.262681] R13:
ffff8dbd8c98c000 R14:
0000000000000000 R15:
ffff998b07607d78
[ 2055.269879] FS:
0000000000000000(0000) GS:
ffff8dbd8ff00000(0000) knlGS:
0000000000000000
[ 2055.278039] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 2055.283834] CR2:
0000000000000018 CR3:
0000000c0c8cc005 CR4:
00000000007626e0
[ 2055.291030] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 2055.298227] DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
[ 2055.305424] PKRU:
55555554
[ 2055.308153] Call Trace:
[ 2055.310624] xdp_do_redirect+0x7b/0x380
[ 2055.314499] tun_get_user+0x10fe/0x12a0 [tun]
[ 2055.318895] tun_sendmsg+0x52/0x70 [tun]
[ 2055.322852] handle_tx+0x2ad/0x5f0 [vhost_net]
[ 2055.327337] vhost_worker+0xa5/0x100 [vhost]
[ 2055.331646] kthread+0xf5/0x130
[ 2055.334813] ? vhost_dev_ioctl+0x3b0/0x3b0 [vhost]
[ 2055.339646] ? kthread_bind+0x10/0x10
[ 2055.343343] ret_from_fork+0x35/0x40
[ 2055.346950] Code: 0e 74 15 83 f8 10 75 05 e9 e9 aa b3 ff f3 c3 0f 1f 80 00 00 00 00 f3 c3 e9 c9 9d b3 ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 47 18 83 f8 0e 74 0d 83 f8 10 75 05 e9 e9 a9 b3 ff 31 c0 c3
[ 2055.366004] RIP: __xdp_map_lookup_elem+0x5/0x30 RSP:
ffff998b07607cc0
[ 2055.372500] CR2:
0000000000000018
[ 2055.375856] ---[ end trace
2a2dcc5e9e174268 ]---
[ 2055.523626] Kernel panic - not syncing: Fatal exception
[ 2055.529796] Kernel Offset: 0x2e000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2055.677539] ---[ end Kernel panic - not syncing: Fatal exception ]---
v2:
- Removed preempt_disable/enable since local_bh_disable will prevent
preemption as well, feedback from Jason Wang.
Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Mon, 28 May 2018 05:26:06 +0000 (01:26 -0400)]
be2net: Fix error detection logic for BE3
Check for 0xE00 (RECOVERABLE_ERR) along with ARMFW UE (0x0)
in be_detect_error() to know whether the error is valid error or not
Fixes: 673c96e5a ("be2net: Fix UE detection logic for BE3")
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Josh Hill [Mon, 28 May 2018 00:10:41 +0000 (20:10 -0400)]
net: qmi_wwan: Add Netgear Aircard 779S
Add support for Netgear Aircard 779S
Signed-off-by: Josh Hill <josh@joshuajhill.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Sun, 27 May 2018 06:48:41 +0000 (09:48 +0300)]
mlxsw: spectrum: Forbid creation of VLAN 1 over port/LAG
VLAN 1 is internally used for untagged traffic. Prevent creation of
explicit netdevice for that VLAN, because that currently isn't supported
and leads to the NULL pointer dereference cited below.
Fix by preventing creation of VLAN devices with VID of 1 over mlxsw
devices or LAG devices that involve mlxsw devices.
[ 327.175816] ================================================================================
[ 327.184544] UBSAN: Undefined behaviour in drivers/net/ethernet/mellanox/mlxsw/spectrum_fid.c:200:12
[ 327.193667] member access within null pointer of type 'const struct mlxsw_sp_fid'
[ 327.201226] CPU: 0 PID: 8983 Comm: ip Not tainted 4.17.0-rc4-petrm_net_ip6gre_headroom-custom-140 #11
[ 327.210496] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016
[ 327.219872] Call Trace:
[ 327.222384] dump_stack+0xc3/0x12b
[ 327.234007] ubsan_epilogue+0x9/0x49
[ 327.237638] ubsan_type_mismatch_common+0x1f9/0x2d0
[ 327.255769] __ubsan_handle_type_mismatch+0x90/0xa7
[ 327.264716] mlxsw_sp_fid_type+0x35/0x50 [mlxsw_spectrum]
[ 327.270255] mlxsw_sp_port_vlan_router_leave+0x46/0xc0 [mlxsw_spectrum]
[ 327.277019] mlxsw_sp_inetaddr_port_vlan_event+0xe1/0x340 [mlxsw_spectrum]
[ 327.315031] mlxsw_sp_netdevice_vrf_event+0xa8/0x100 [mlxsw_spectrum]
[ 327.321626] mlxsw_sp_netdevice_event+0x276/0x430 [mlxsw_spectrum]
[ 327.367863] notifier_call_chain+0x4c/0x150
[ 327.372128] __netdev_upper_dev_link+0x1b3/0x260
[ 327.399450] vrf_add_slave+0xce/0x170 [vrf]
[ 327.403703] do_setlink+0x658/0x1d70
[ 327.508998] rtnl_newlink+0x908/0xf20
[ 327.559128] rtnetlink_rcv_msg+0x50c/0x720
[ 327.571720] netlink_rcv_skb+0x16a/0x1f0
[ 327.583450] netlink_unicast+0x2ca/0x3e0
[ 327.599305] netlink_sendmsg+0x3e2/0x7f0
[ 327.616655] sock_sendmsg+0x76/0xc0
[ 327.620207] ___sys_sendmsg+0x494/0x5d0
[ 327.666117] __sys_sendmsg+0xc2/0x130
[ 327.690953] do_syscall_64+0x66/0x370
[ 327.694677] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 327.699782] RIP: 0033:0x7f4c2f3f8037
[ 327.703393] RSP: 002b:
00007ffe8c389708 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
[ 327.711035] RAX:
ffffffffffffffda RBX:
000000005b03f53e RCX:
00007f4c2f3f8037
[ 327.718229] RDX:
0000000000000000 RSI:
00007ffe8c389760 RDI:
0000000000000003
[ 327.725431] RBP:
00007ffe8c389760 R08:
0000000000000000 R09:
00007f4c2f443630
[ 327.732632] R10:
00000000000005eb R11:
0000000000000246 R12:
0000000000000000
[ 327.739833] R13:
00000000006774e0 R14:
00007ffe8c3897e8 R15:
0000000000000000
[ 327.747096] ================================================================================
Fixes: 9589a7b5d7d9 ("mlxsw: spectrum: Handle VLAN devices linking / unlinking")
Suggested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Bornyakov [Fri, 25 May 2018 17:49:52 +0000 (20:49 +0300)]
atm: zatm: fix memcmp casting
memcmp() returns int, but eprom_try_esi() cast it to unsigned char. One
can lose significant bits and get 0 from non-0 value returned by the
memcmp().
Signed-off-by: Ivan Bornyakov <brnkv.i1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Wei Tee [Tue, 29 May 2018 07:25:17 +0000 (10:25 +0300)]
iwlwifi: pcie: compare with number of IRQs requested for, not number of CPUs
When there are 16 or more logical CPUs, we request for
`IWL_MAX_RX_HW_QUEUES` (16) IRQs only as we limit to that number of
IRQs, but later on we compare the number of IRQs returned to
nr_online_cpus+2 instead of max_irqs, the latter being what we
actually asked for. This ends up setting num_rx_queues to 17 which
causes lots of out-of-bounds array accesses later on.
Compare to max_irqs instead, and also add an assertion in case
num_rx_queues > IWM_MAX_RX_HW_QUEUES.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199551
Fixes: 2e5d4a8f61dc ("iwlwifi: pcie: Add new configuration to enable MSIX")
Signed-off-by: Hao Wei Tee <angelsl@in04.sg>
Tested-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Stanislaw Gruszka [Mon, 28 May 2018 11:25:06 +0000 (13:25 +0200)]
Revert "rt2800: use TXOP_BACKOFF for probe frames"
This reverts commit
fb47ada8dc3c30c8e7b415da155742b49536c61e.
In some situations when we set TXOP_BACKOFF, the probe frame is
not sent at all. What it worse then sending probe frame as part
of AMPDU and can degrade 11n performance to 11g rates.
Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Ard Biesheuvel [Fri, 25 May 2018 12:50:37 +0000 (14:50 +0200)]
net: netsec: reduce DMA mask to 40 bits
The netsec network controller IP can drive 64 address bits for DMA, and
the DMA mask is set accordingly in the driver. However, the SynQuacer
SoC, which is the only silicon incorporating this IP at the moment,
integrates this IP in a manner that leaves address bits [63:40]
unconnected.
Up until now, this has not resulted in any problems, given that the DDR
controller doesn't decode those bits to begin with. However, recent
firmware updates for platforms incorporating this SoC allow the IOMMU
to be enabled, which does decode address bits [47:40], and allocates
top down from the IOVA space, producing DMA addresses that have bits
set that have been left unconnected.
Both the DT and ACPI (IORT) descriptions of the platform take this into
account, and only describe a DMA address space of 40 bits (using either
dma-ranges DT properties, or DMA address limits in IORT named component
nodes). However, even though our IOMMU and bus layers may take such
limitations into account by setting a narrower DMA mask when creating
the platform device, the netsec probe() entrypoint follows the common
practice of setting the DMA mask uncondionally, according to the
capabilities of the IP block itself rather than to its integration into
the chip.
It is currently unclear what the correct fix is here. We could hack around
it by only setting the DMA mask if it deviates from its default value of
DMA_BIT_MASK(32). However, this makes it impossible for the bus layer to
use DMA_BIT_MASK(32) as the bus limit, and so it appears that a more
comprehensive approach is required to take DMA limits imposed by the
SoC as a whole into account.
In the mean time, let's limit the DMA mask to 40 bits. Given that there
is currently only one SoC that incorporates this IP, this is a reasonable
approach that can be backported to -stable and buys us some time to come
up with a proper fix going forward.
Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Jassi Brar <jaswinder.singh@linaro.org>
Cc: Masahisa Kojima <masahisa.kojima@linaro.org>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathieu Xhonneux [Fri, 25 May 2018 12:29:41 +0000 (13:29 +0100)]
ipv6: sr: fix memory OOB access in seg6_do_srh_encap/inline
seg6_do_srh_encap and seg6_do_srh_inline can possibly do an
out-of-bounds access when adding the SRH to the packet. This no longer
happen when expanding the skb not only by the size of the SRH (+
outer IPv6 header), but also by skb->mac_len.
[ 53.793056] BUG: KASAN: use-after-free in seg6_do_srh_encap+0x284/0x620
[ 53.794564] Write of size 14 at addr
ffff88011975ecfa by task ping/674
[ 53.796665] CPU: 0 PID: 674 Comm: ping Not tainted 4.17.0-rc3-ARCH+ #90
[ 53.796670] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.11.0-20171110_100015-anatol 04/01/2014
[ 53.796673] Call Trace:
[ 53.796679] <IRQ>
[ 53.796689] dump_stack+0x71/0xab
[ 53.796700] print_address_description+0x6a/0x270
[ 53.796707] kasan_report+0x258/0x380
[ 53.796715] ? seg6_do_srh_encap+0x284/0x620
[ 53.796722] memmove+0x34/0x50
[ 53.796730] seg6_do_srh_encap+0x284/0x620
[ 53.796741] ? seg6_do_srh+0x29b/0x360
[ 53.796747] seg6_do_srh+0x29b/0x360
[ 53.796756] seg6_input+0x2e/0x2e0
[ 53.796765] lwtunnel_input+0x93/0xd0
[ 53.796774] ipv6_rcv+0x690/0x920
[ 53.796783] ? ip6_input+0x170/0x170
[ 53.796791] ? eth_gro_receive+0x2d0/0x2d0
[ 53.796800] ? ip6_input+0x170/0x170
[ 53.796809] __netif_receive_skb_core+0xcc0/0x13f0
[ 53.796820] ? netdev_info+0x110/0x110
[ 53.796827] ? napi_complete_done+0xb6/0x170
[ 53.796834] ? e1000_clean+0x6da/0xf70
[ 53.796845] ? process_backlog+0x129/0x2a0
[ 53.796853] process_backlog+0x129/0x2a0
[ 53.796862] net_rx_action+0x211/0x5c0
[ 53.796870] ? napi_complete_done+0x170/0x170
[ 53.796887] ? run_rebalance_domains+0x11f/0x150
[ 53.796891] __do_softirq+0x10e/0x39e
[ 53.796894] do_softirq_own_stack+0x2a/0x40
[ 53.796895] </IRQ>
[ 53.796898] do_softirq.part.16+0x54/0x60
[ 53.796900] __local_bh_enable_ip+0x5b/0x60
[ 53.796903] ip6_finish_output2+0x416/0x9f0
[ 53.796906] ? ip6_dst_lookup_flow+0x110/0x110
[ 53.796909] ? ip6_sk_dst_lookup_flow+0x390/0x390
[ 53.796911] ? __rcu_read_unlock+0x66/0x80
[ 53.796913] ? ip6_mtu+0x44/0xf0
[ 53.796916] ? ip6_output+0xfc/0x220
[ 53.796918] ip6_output+0xfc/0x220
[ 53.796921] ? ip6_finish_output+0x2b0/0x2b0
[ 53.796923] ? memcpy+0x34/0x50
[ 53.796926] ip6_send_skb+0x43/0xc0
[ 53.796929] rawv6_sendmsg+0x1216/0x1530
[ 53.796932] ? __orc_find+0x6b/0xc0
[ 53.796934] ? rawv6_rcv_skb+0x160/0x160
[ 53.796937] ? __rcu_read_unlock+0x66/0x80
[ 53.796939] ? __rcu_read_unlock+0x66/0x80
[ 53.796942] ? is_bpf_text_address+0x1e/0x30
[ 53.796944] ? kernel_text_address+0xec/0x100
[ 53.796946] ? __kernel_text_address+0xe/0x30
[ 53.796948] ? unwind_get_return_address+0x2f/0x50
[ 53.796950] ? __save_stack_trace+0x92/0x100
[ 53.796954] ? save_stack+0x89/0xb0
[ 53.796956] ? kasan_kmalloc+0xa0/0xd0
[ 53.796958] ? kmem_cache_alloc+0xd2/0x1f0
[ 53.796961] ? prepare_creds+0x23/0x160
[ 53.796963] ? __x64_sys_capset+0x252/0x3e0
[ 53.796966] ? do_syscall_64+0x69/0x160
[ 53.796968] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 53.796971] ? __alloc_pages_nodemask+0x170/0x380
[ 53.796973] ? __alloc_pages_slowpath+0x12c0/0x12c0
[ 53.796977] ? tty_vhangup+0x20/0x20
[ 53.796979] ? policy_nodemask+0x1a/0x90
[ 53.796982] ? __mod_node_page_state+0x8d/0xa0
[ 53.796986] ? __check_object_size+0xe7/0x240
[ 53.796989] ? __sys_sendto+0x229/0x290
[ 53.796991] ? rawv6_rcv_skb+0x160/0x160
[ 53.796993] __sys_sendto+0x229/0x290
[ 53.796996] ? __ia32_sys_getpeername+0x50/0x50
[ 53.796999] ? commit_creds+0x2de/0x520
[ 53.797002] ? security_capset+0x57/0x70
[ 53.797004] ? __x64_sys_capset+0x29f/0x3e0
[ 53.797007] ? __x64_sys_rt_sigsuspend+0xe0/0xe0
[ 53.797011] ? __do_page_fault+0x664/0x770
[ 53.797014] __x64_sys_sendto+0x74/0x90
[ 53.797017] do_syscall_64+0x69/0x160
[ 53.797019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 53.797022] RIP: 0033:0x7f43b7a6714a
[ 53.797023] RSP: 002b:
00007ffd891bd368 EFLAGS:
00000246 ORIG_RAX:
000000000000002c
[ 53.797026] RAX:
ffffffffffffffda RBX:
00000000006129c0 RCX:
00007f43b7a6714a
[ 53.797028] RDX:
0000000000000040 RSI:
00000000006129c0 RDI:
0000000000000004
[ 53.797029] RBP:
00007ffd891be640 R08:
0000000000610940 R09:
000000000000001c
[ 53.797030] R10:
0000000000000000 R11:
0000000000000246 R12:
0000000000000040
[ 53.797032] R13:
000000000060e6a0 R14:
0000000000008004 R15:
000000000040b661
[ 53.797171] Allocated by task 642:
[ 53.797460] kasan_kmalloc+0xa0/0xd0
[ 53.797463] kmem_cache_alloc+0xd2/0x1f0
[ 53.797465] getname_flags+0x40/0x210
[ 53.797467] user_path_at_empty+0x1d/0x40
[ 53.797469] do_faccessat+0x12a/0x320
[ 53.797471] do_syscall_64+0x69/0x160
[ 53.797473] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 53.797607] Freed by task 642:
[ 53.797869] __kasan_slab_free+0x130/0x180
[ 53.797871] kmem_cache_free+0xa8/0x230
[ 53.797872] filename_lookup+0x15b/0x230
[ 53.797874] do_faccessat+0x12a/0x320
[ 53.797876] do_syscall_64+0x69/0x160
[ 53.797878] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 53.798014] The buggy address belongs to the object at
ffff88011975e600
which belongs to the cache names_cache of size 4096
[ 53.799043] The buggy address is located 1786 bytes inside of
4096-byte region [
ffff88011975e600,
ffff88011975f600)
[ 53.800013] The buggy address belongs to the page:
[ 53.800414] page:
ffffea000465d600 count:1 mapcount:0
mapping:
0000000000000000 index:0x0 compound_mapcount: 0
[ 53.801259] flags: 0x17fff0000008100(slab|head)
[ 53.801640] raw:
017fff0000008100 0000000000000000 0000000000000000
0000000100070007
[ 53.803147] raw:
dead000000000100 dead000000000200 ffff88011b185a40
0000000000000000
[ 53.803787] page dumped because: kasan: bad access detected
[ 53.804384] Memory state around the buggy address:
[ 53.804788]
ffff88011975eb80: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 53.805384]
ffff88011975ec00: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 53.805979] >
ffff88011975ec80: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 53.806577] ^
[ 53.807165]
ffff88011975ed00: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 53.807762]
ffff88011975ed80: fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb fb
[ 53.808356] ==================================================================
[ 53.808949] Disabling lock debugging due to kernel taint
Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Signed-off-by: David Lebrun <dlebrun@google.com>
Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 29 May 2018 02:39:09 +0000 (22:39 -0400)]
Merge git://git./pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Null pointer dereference when dumping conntrack helper configuration,
from Taehee Yoo.
2) Missing sanitization in ebtables extension name through compat,
from Paolo Abeni.
3) Broken fetch of tracing value, from Taehee Yoo.
4) Incorrect arithmetics in packet ratelimiting.
5) Buffer overflow in IPVS sync daemon, from Julian Anastasov.
6) Wrong argument to nla_strlcpy() in nfnetlink_{acct,cthelper},
from Eric Dumazet.
7) Fix splat in nft_update_chain_stats().
8) Null pointer dereference from object netlink dump path, from
Taehee Yoo.
9) Missing static_branch_inc() when enabling counters in existing
chain, from Taehee Yoo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo [Mon, 28 May 2018 16:14:12 +0000 (01:14 +0900)]
netfilter: nf_tables: increase nft_counters_enabled in nft_chain_stats_replace()
When a chain is updated, a counter can be attached. if so,
the nft_counters_enabled should be increased.
test commands:
%nft add table ip filter
%nft add chain ip filter input { type filter hook input priority 4\; }
%iptables-compat -Z input
%nft delete chain ip filter input
we can see below messages.
[ 286.443720] jump label: negative count!
[ 286.448278] WARNING: CPU: 0 PID: 1459 at kernel/jump_label.c:197 __static_key_slow_dec_cpuslocked+0x6f/0xf0
[ 286.449144] Modules linked in: nf_tables nfnetlink ip_tables x_tables
[ 286.449144] CPU: 0 PID: 1459 Comm: nft Tainted: G W 4.17.0-rc2+ #12
[ 286.449144] RIP: 0010:__static_key_slow_dec_cpuslocked+0x6f/0xf0
[ 286.449144] RSP: 0018:
ffff88010e5176f0 EFLAGS:
00010286
[ 286.449144] RAX:
000000000000001b RBX:
ffffffffc0179500 RCX:
ffffffffb8a82522
[ 286.449144] RDX:
0000000000000001 RSI:
0000000000000008 RDI:
ffff88011b7e5eac
[ 286.449144] RBP:
0000000000000000 R08:
ffffed00236fce5c R09:
ffffed00236fce5b
[ 286.449144] R10:
ffffffffc0179503 R11:
ffffed00236fce5c R12:
0000000000000000
[ 286.449144] R13:
ffff88011a28e448 R14:
ffff88011a28e470 R15:
dffffc0000000000
[ 286.449144] FS:
00007f0384328700(0000) GS:
ffff88011b600000(0000) knlGS:
0000000000000000
[ 286.449144] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 286.449144] CR2:
00007f038394bf10 CR3:
0000000104a86000 CR4:
00000000001006f0
[ 286.449144] Call Trace:
[ 286.449144] static_key_slow_dec+0x6a/0x70
[ 286.449144] nf_tables_chain_destroy+0x19d/0x210 [nf_tables]
[ 286.449144] nf_tables_commit+0x1891/0x1c50 [nf_tables]
[ 286.449144] nfnetlink_rcv+0x1148/0x13d0 [nfnetlink]
[ ... ]
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Taehee Yoo [Mon, 28 May 2018 16:13:45 +0000 (01:13 +0900)]
netfilter: nf_tables: fix NULL-ptr in nf_tables_dump_obj()
The table field in nft_obj_filter is not an array. In order to check
tablename, we should check if the pointer is set.
Test commands:
%nft add table ip filter
%nft add counter ip filter ct1
%nft reset counters
Splat looks like:
[ 306.510504] kasan: CONFIG_KASAN_INLINE enabled
[ 306.516184] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 306.524775] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 306.528284] Modules linked in: nft_objref nft_counter nf_tables nfnetlink ip_tables x_tables
[ 306.528284] CPU: 0 PID: 1488 Comm: nft Not tainted 4.17.0-rc4+ #17
[ 306.528284] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[ 306.528284] RIP: 0010:nf_tables_dump_obj+0x52c/0xa70 [nf_tables]
[ 306.528284] RSP: 0018:
ffff8800b6cb7520 EFLAGS:
00010246
[ 306.528284] RAX:
0000000000000000 RBX:
ffff8800b6c49820 RCX:
0000000000000000
[ 306.528284] RDX:
0000000000000000 RSI:
dffffc0000000000 RDI:
ffffed0016d96e9a
[ 306.528284] RBP:
ffff8800b6cb75c0 R08:
ffffed00236fce7c R09:
ffffed00236fce7b
[ 306.528284] R10:
ffffffff9f6241e8 R11:
ffffed00236fce7c R12:
ffff880111365108
[ 306.528284] R13:
0000000000000000 R14:
ffff8800b6c49860 R15:
ffff8800b6c49860
[ 306.528284] FS:
00007f838b007700(0000) GS:
ffff88011b600000(0000) knlGS:
0000000000000000
[ 306.528284] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 306.528284] CR2:
00007ffeafabcf78 CR3:
00000000b6cbe000 CR4:
00000000001006f0
[ 306.528284] Call Trace:
[ 306.528284] netlink_dump+0x470/0xa20
[ 306.528284] __netlink_dump_start+0x5ae/0x690
[ 306.528284] ? nf_tables_getobj+0x1b3/0x740 [nf_tables]
[ 306.528284] nf_tables_getobj+0x2f5/0x740 [nf_tables]
[ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 306.528284] ? nf_tables_getobj+0x740/0x740 [nf_tables]
[ 306.528284] ? nf_tables_dump_flowtable_done+0x70/0x70 [nf_tables]
[ 306.528284] ? nft_obj_notify+0x100/0x100 [nf_tables]
[ 306.528284] nfnetlink_rcv_msg+0x8ff/0x932 [nfnetlink]
[ 306.528284] ? nfnetlink_rcv_msg+0x216/0x932 [nfnetlink]
[ 306.528284] netlink_rcv_skb+0x1c9/0x2f0
[ 306.528284] ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 306.528284] ? debug_check_no_locks_freed+0x270/0x270
[ 306.528284] ? netlink_ack+0x7a0/0x7a0
[ 306.528284] ? ns_capable_common+0x6e/0x110
[ ... ]
Fixes: e46abbcc05aa8 ("netfilter: nf_tables: Allow table names of up to 255 chars")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso [Sun, 27 May 2018 19:08:13 +0000 (21:08 +0200)]
netfilter: nf_tables: disable preemption in nft_update_chain_stats()
This patch fixes the following splat.
[118709.054937] BUG: using smp_processor_id() in preemptible [
00000000] code: test/1571
[118709.054970] caller is nft_update_chain_stats.isra.4+0x53/0x97 [nf_tables]
[118709.054980] CPU: 2 PID: 1571 Comm: test Not tainted 4.17.0-rc6+ #335
[...]
[118709.054992] Call Trace:
[118709.055011] dump_stack+0x5f/0x86
[118709.055026] check_preemption_disabled+0xd4/0xe4
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Linus Torvalds [Sat, 26 May 2018 03:24:28 +0000 (20:24 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
kasan: fix memory hotplug during boot
kasan: free allocated shadow memory on MEM_CANCEL_ONLINE
checkpatch: fix macro argument precedence test
init/main.c: include <linux/mem_encrypt.h>
kernel/sys.c: fix potential Spectre v1 issue
mm/memory_hotplug: fix leftover use of struct page during hotplug
proc: fix smaps and meminfo alignment
mm: do not warn on offline nodes unless the specific node is explicitly requested
mm, memory_hotplug: make has_unmovable_pages more robust
mm/kasan: don't vfree() nonexistent vm_area
MAINTAINERS: change hugetlbfs maintainer and update files
ipc/shm: fix shmat() nil address after round-down when remapping
Revert "ipc/shm: Fix shmat mmap nil-page protection"
idr: fix invalid ptr dereference on item delete
ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"
mm: fix nr_rotate_swap leak in swapon() error case
Linus Torvalds [Sat, 26 May 2018 02:54:42 +0000 (19:54 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"Let's begin the holiday weekend with some networking fixes:
1) Whoops need to restrict cfg80211 wiphy names even more to 64
bytes. From Eric Biggers.
2) Fix flags being ignored when using kernel_connect() with SCTP,
from Xin Long.
3) Use after free in DCCP, from Alexey Kodanev.
4) Need to check rhltable_init() return value in ipmr code, from Eric
Dumazet.
5) XDP handling fixes in virtio_net from Jason Wang.
6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu.
7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack
Morgenstein.
8) Prevent out-of-bounds speculation using indexes in BPF, from
Daniel Borkmann.
9) Fix regression added by AF_PACKET link layer cure, from Willem de
Bruijn.
10) Correct ENIC dma mask, from Govindarajulu Varadarajan.
11) Missing config options for PMTU tests, from Stefano Brivio"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
ibmvnic: Fix partial success login retries
selftests/net: Add missing config options for PMTU tests
mlx4_core: allocate ICM memory in page size chunks
enic: set DMA mask to 47 bit
ppp: remove the PPPIOCDETACH ioctl
ipv4: remove warning in ip_recv_error
net : sched: cls_api: deal with egdev path only if needed
vhost: synchronize IOTLB message with dev cleanup
packet: fix reserve calculation
net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
bpf: properly enforce index mask to prevent out-of-bounds speculation
net/mlx4: Fix irq-unsafe spinlock usage
net: phy: broadcom: Fix bcm_write_exp()
net: phy: broadcom: Fix auxiliary control register reads
net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
ibmvnic: Only do H_EOI for mobility events
tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
virtio-net: fix leaking page for gso packet during mergeable XDP
...
David Hildenbrand [Fri, 25 May 2018 21:48:11 +0000 (14:48 -0700)]
kasan: fix memory hotplug during boot
Using module_init() is wrong. E.g. ACPI adds and onlines memory before
our memory notifier gets registered.
This makes sure that ACPI memory detected during boot up will not result
in a kernel crash.
Easily reproducible with QEMU, just specify a DIMM when starting up.
Link: http://lkml.kernel.org/r/20180522100756.18478-3-david@redhat.com
Fixes: 786a8959912e ("kasan: disable memory hotplug")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 25 May 2018 21:48:08 +0000 (14:48 -0700)]
kasan: free allocated shadow memory on MEM_CANCEL_ONLINE
We have to free memory again when we cancel onlining, otherwise a later
onlining attempt will fail.
Link: http://lkml.kernel.org/r/20180522100756.18478-2-david@redhat.com
Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 25 May 2018 21:48:04 +0000 (14:48 -0700)]
checkpatch: fix macro argument precedence test
checkpatch's macro argument precedence test is broken so fix it.
Link: http://lkml.kernel.org/r/5dd900e9197febc1995604bb33c23c136d8b33ce.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mathieu Malaterre [Fri, 25 May 2018 21:48:00 +0000 (14:48 -0700)]
init/main.c: include <linux/mem_encrypt.h>
In commit
c7753208a94c ("x86, swiotlb: Add memory encryption support") a
call to function `mem_encrypt_init' was added. Include prototype
defined in header <linux/mem_encrypt.h> to prevent a warning reported
during compilation with W=1:
init/main.c:494:20: warning: no previous prototype for `mem_encrypt_init' [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180522195533.31415-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Gargi Sharma <gs051095@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gustavo A. R. Silva [Fri, 25 May 2018 21:47:57 +0000 (14:47 -0700)]
kernel/sys.c: fix potential Spectre v1 issue
`resource' can be controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
Fix this by sanitizing *resource* before using it to index
current->signal->rlim
Notice that given that speculation windows are large, the policy is to
kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=
152449131114778&w=2
Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jonathan Cameron [Fri, 25 May 2018 21:47:53 +0000 (14:47 -0700)]
mm/memory_hotplug: fix leftover use of struct page during hotplug
The case of a new numa node got missed in avoiding using the node info
from page_struct during hotplug. In this path we have a call to
register_mem_sect_under_node (which allows us to specify it is hotplug
so don't change the node), via link_mem_sections which unfortunately
does not.
Fix is to pass check_nid through link_mem_sections as well and disable
it in the new numa node path.
Note the bug only 'sometimes' manifests depending on what happens to be
in the struct page structures - there are lots of them and it only needs
to match one of them.
The result of the bug is that (with a new memory only node) we never
successfully call register_mem_sect_under_node so don't get the memory
associated with the node in sysfs and meminfo for the node doesn't
report it.
It came up whilst testing some arm64 hotplug patches, but appears to be
universal. Whilst I'm triggering it by removing then reinserting memory
to a node with no other elements (thus making the node disappear then
appear again), it appears it would happen on hotplugging memory where
there was none before and it doesn't seem to be related the arm64
patches.
These patches call __add_pages (where most of the issue was fixed by
Pavel's patch). If there is a node at the time of the __add_pages call
then all is well as it calls register_mem_sect_under_node from there
with check_nid set to false. Without a node that function returns
having not done the sysfs related stuff as there is no node to use.
This is expected but it is the resulting path that fails...
Exact path to the problem is as follows:
mm/memory_hotplug.c: add_memory_resource()
The node is not online so we enter the 'if (new_node)' twice, on the
second such block there is a call to link_mem_sections which calls
into
drivers/node.c: link_mem_sections() which calls
drivers/node.c: register_mem_sect_under_node() which calls
get_nid_for_pfn and keeps trying until the output of that matches
the expected node (passed all the way down from
add_memory_resource)
It is effectively the same fix as the one referred to in the fixes tag
just in the code path for a new node where the comments point out we
have to rerun the link creation because it will have failed in
register_new_memory (as there was no node at the time). (actually that
comment is wrong now as we don't have register_new_memory any more it
got renamed to hotplug_memory_register in Pavel's patch).
Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
Fixes: fc44f7f9231a ("mm/memory_hotplug: don't read nid from struct page during hotplug")
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Fri, 25 May 2018 21:47:50 +0000 (14:47 -0700)]
proc: fix smaps and meminfo alignment
The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
numbers (commonly 0) are misaligned.
Remove seq_put_decimal_ull_width()'s leftover optimization for single
digits: it's wrong now that num_to_str() takes care of the width.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
Fixes: d1be35cb6f96 ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 May 2018 21:47:46 +0000 (14:47 -0700)]
mm: do not warn on offline nodes unless the specific node is explicitly requested
Oscar has noticed that we splat
WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9
[...]
CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G W E 4.17.0-rc5-next-
20180517-1-default+ #66
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
Call Trace:
vmemmap_populate+0xf2/0x2ae
sparse_mem_map_populate+0x28/0x35
sparse_add_one_section+0x4c/0x187
__add_pages+0xe7/0x1a0
add_pages+0x16/0x70
add_memory_resource+0xa3/0x1d0
add_memory+0xe4/0x110
acpi_memory_device_add+0x134/0x2e0
acpi_bus_attach+0xd9/0x190
acpi_bus_scan+0x37/0x70
acpi_device_hotplug+0x389/0x4e0
acpi_hotplug_work_fn+0x1a/0x30
process_one_work+0x146/0x340
worker_thread+0x47/0x3e0
kthread+0xf5/0x130
ret_from_fork+0x35/0x40
when adding memory to a node that is currently offline.
The VM_WARN_ON is just too loud without a good reason. In this
particular case we are doing
alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order)
so we do not insist on allocating from the given node (it is more a
hint) so we can fall back to any other populated node and moreover we
explicitly ask to not warn for the allocation failure.
Soften the warning only to cases when somebody asks for the given node
explicitly by __GFP_THISNODE.
Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Oscar Salvador <osalvador@techadventures.net>
Tested-by: Oscar Salvador <osalvador@techadventures.net>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 25 May 2018 21:47:42 +0000 (14:47 -0700)]
mm, memory_hotplug: make has_unmovable_pages more robust
Oscar has reported:
: Due to an unfortunate setting with movablecore, memblocks containing bootmem
: memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
: So while trying to remove that memory, the system failed in do_migrate_range
: and __offline_pages never returned.
:
: This can be reproduced by running
: qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
: and movablecore=4G kernel command line
:
: linux kernel: BIOS-provided physical RAM map:
: linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
: linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
: linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
: linux kernel: NX (Execute Disable) protection: active
: linux kernel: SMBIOS 2.8 present.
: linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
: linux kernel: Hypervisor detected: KVM
: linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
: linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
: linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
:
: linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
: linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
: linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
: linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
: linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
: linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
: linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
:
: zoneinfo shows that the zone movable is placed into both numa nodes:
: Node 0, zone Movable
: pages free 160140
: min 1823
: low 2278
: high 2733
: spanned 262144
: present 262144
: managed 245670
: Node 1, zone Movable
: pages free 448427
: min 3827
: low 4783
: high 5739
: spanned 524288
: present 524288
: managed 515766
Note how only Node 0 has a hutplugable memory region which would rule it
out from the early memblock allocations (most likely memmap). Node1
will surely contain memmaps on the same node and those would prevent
offlining to succeed. So this is arguably a configuration issue.
Although one could argue that we should be more clever and rule early
allocations from the zone movable. This would be correct but probably
not worth the effort considering what a hack movablecore is.
Anyway, We could do better for those cases though. We rely on
start_isolate_page_range resp. has_unmovable_pages to do their job.
The first one isolates the whole range to be offlined so that we do not
allocate from it anymore and the later makes sure we are not stumbling
over non-migrateable pages.
has_unmovable_pages is overly optimistic, however. It doesn't check all
the pages if we are withing zone_movable because we rely that those
pages will be always migrateable. As it turns out we are still not
perfect there. While bootmem pages in zonemovable sound like a clear
bug which should be fixed let's remove the optimization for now and warn
if we encounter unmovable pages in zone_movable in the meantime. That
should help for now at least.
Btw. this wasn't a real problem until commit
72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early") because we used to
have a small number of retries and then failed. This turned out to be
too fragile though.
Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Oscar Salvador <osalvador@techadventures.net>
Tested-by: Oscar Salvador <osalvador@techadventures.net>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Fri, 25 May 2018 21:47:38 +0000 (14:47 -0700)]
mm/kasan: don't vfree() nonexistent vm_area
KASAN uses different routines to map shadow for hot added memory and
memory obtained in boot process. Attempt to offline memory onlined by
normal boot process leads to this:
Trying to vfree() nonexistent vm area (
000000005d3b34b9)
WARNING: CPU: 2 PID: 13215 at mm/vmalloc.c:1525 __vunmap+0x147/0x190
Call Trace:
kasan_mem_notifier+0xad/0xb9
notifier_call_chain+0x166/0x260
__blocking_notifier_call_chain+0xdb/0x140
__offline_pages+0x96a/0xb10
memory_subsys_offline+0x76/0xc0
device_offline+0xb8/0x120
store_mem_state+0xfa/0x120
kernfs_fop_write+0x1d5/0x320
__vfs_write+0xd4/0x530
vfs_write+0x105/0x340
SyS_write+0xb0/0x140
Obviously we can't call vfree() to free memory that wasn't allocated via
vmalloc(). Use find_vm_area() to see if we can call vfree().
Unfortunately it's a bit tricky to properly unmap and free shadow
allocated during boot, so we'll have to keep it. If memory will come
online again that shadow will be reused.
Matthew asked: how can you call vfree() on something that isn't a
vmalloc address?
vfree() is able to free any address returned by
__vmalloc_node_range(). And __vmalloc_node_range() gives you any
address you ask. It doesn't have to be an address in [VMALLOC_START,
VMALLOC_END] range.
That's also how the module_alloc()/module_memfree() works on
architectures that have designated area for modules.
[aryabinin@virtuozzo.com: improve comments]
Link: http://lkml.kernel.org/r/dabee6ab-3a7a-51cd-3b86-5468718e0390@virtuozzo.com
[akpm@linux-foundation.org: fix typos, reflow comment]
Link: http://lkml.kernel.org/r/20180201163349.8700-1-aryabinin@virtuozzo.com
Fixes: fa69b5989bb0 ("mm/kasan: add support for memory hotplug")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Paul Menzel <pmenzel+linux-kasan-dev@molgen.mpg.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 25 May 2018 21:47:35 +0000 (14:47 -0700)]
MAINTAINERS: change hugetlbfs maintainer and update files
The current hugetlbfs maintainer has not been active for more than a few
years. I have been been active in this area for more than two years and
plan to remain active in the foreseeable future.
Also, update the hugetlbfs entry to include linux-mm mail list and
additional hugetlbfs related files. hugetlb.c and hugetlb.h are not
100% hugetlbfs, but a majority of their content is hugetlbfs related.
Link: http://lkml.kernel.org/r/20180518225236.19079-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Fri, 25 May 2018 21:47:30 +0000 (14:47 -0700)]
ipc/shm: fix shmat() nil address after round-down when remapping
shmat()'s SHM_REMAP option forbids passing a nil address for; this is in
fact the very first thing we check for. Andrea reported that for
SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check,
but we need to check again if the address was rounded down to nil. As
of this patch, such cases will return -EINVAL.
Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Fri, 25 May 2018 21:47:27 +0000 (14:47 -0700)]
Revert "ipc/shm: Fix shmat mmap nil-page protection"
Patch series "ipc/shm: shmat() fixes around nil-page".
These patches fix two issues reported[1] a while back by Joe and Andrea
around how shmat(2) behaves with nil-page.
The first reverts a commit that it was incorrectly thought that mapping
nil-page (address=0) was a no no with MAP_FIXED. This is not the case,
with the exception of SHM_REMAP; which is address in the second patch.
I chose two patches because it is easier to backport and it explicitly
reverts bogus behaviour. Both patches ought to be in -stable and ltp
testcases need updated (the added testcase around the cve can be
modified to just test for SHM_RND|SHM_REMAP).
[1] lkml.kernel.org/r/
20180430172152.nfa564pvgpk3ut7p@linux-n805
This patch (of 2):
Commit
95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
worked on the idea that we should not be mapping as root addr=0 and
MAP_FIXED. However, it was reported that this scenario is in fact
valid, thus making the patch both bogus and breaks userspace as well.
For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem
initialization[1].
[1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347
Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net
Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Fri, 25 May 2018 21:47:24 +0000 (14:47 -0700)]
idr: fix invalid ptr dereference on item delete
If the radix tree underlying the IDR happens to be full and we attempt
to remove an id which is larger than any id in the IDR, we will call
__radix_tree_delete() with an uninitialised 'slot' pointer, at which
point anything could happen. This was easiest to hit with a single
entry at id 0 and attempting to remove a non-0 id, but it could have
happened with 64 entries and attempting to remove an id >= 64.
Roman said:
The syzcaller test boils down to opening /dev/kvm, creating an
eventfd, and calling a couple of KVM ioctls. None of this requires
superuser. And the result is dereferencing an uninitialized pointer
which is likely a crash. The specific path caught by syzbot is via
KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
other user-triggerable paths, so cc:stable is probably justified.
Matthew added:
We have around 250 calls to idr_remove() in the kernel today. Many of
them pass an ID which is embedded in the object they're removing, so
they're safe. Picking a few likely candidates:
drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
drivers/atm/nicstar.c could be taken down by a handcrafted packet
Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com>
Debugged-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Fri, 25 May 2018 21:47:20 +0000 (14:47 -0700)]
ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"
This reverts commit
ba16ddfbeb9d ("ocfs2/o2hb: check len for
bio_add_page() to avoid getting incorrect bio").
In my testing, this patch introduces a problem that mkfs can't have
slots more than 16 with 4k block size.
And the original logic is safe actually with the situation it mentions
so revert this commit.
Attach test log:
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page
ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192
(mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
(mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
(mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Omar Sandoval [Fri, 25 May 2018 21:47:17 +0000 (14:47 -0700)]
mm: fix nr_rotate_swap leak in swapon() error case
If swapon() fails after incrementing nr_rotate_swap, we don't decrement
it and thus effectively leak it. Make sure we decrement it if we
incremented it.
Link: http://lkml.kernel.org/r/b6fe6b879f17fa68eee6cbd876f459f6e5e33495.1526491581.git.osandov@fb.com
Fixes: 81a0298bdfab ("mm, swap: don't use VMA based swap readahead if HDD is used as swap")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Falcon [Thu, 24 May 2018 19:37:53 +0000 (14:37 -0500)]
ibmvnic: Fix partial success login retries
In its current state, the driver will handle backing device
login in a loop for a certain number of retries while the
device returns a partial success, indicating that the driver
may need to try again using a smaller number of resources.
The variable it checks to continue retrying may change
over the course of operations, resulting in reallocation
of resources but exits without sending the login attempt.
Guard against this by introducing a boolean variable that
will retain the state indicating that the driver needs to
reattempt login with backing device firmware.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 May 2018 19:37:41 +0000 (15:37 -0400)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2018-05-24
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a bug in the original fix to prevent out of bounds speculation when
multiple tail call maps from different branches or calls end up at the
same tail call helper invocation, from Daniel.
2) Two selftest fixes, one in reuseport_bpf_numa where test is skipped in
case of missing numa support and another one to update kernel config to
properly support xdp_meta.sh test, from Anders.
...
Would be great if you have a chance to merge net into net-next after that.
The verifier fix would be needed later as a dependency in bpf-next for
upcomig work there. When you do the merge there's a trivial conflict on
BPF side with
849fa50662fb ("bpf/verifier: refine retval R0 state for
bpf_get_stack helper"): Resolution is to keep both functions, the
do_refine_retval_range() and record_func_map().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Thu, 24 May 2018 14:10:12 +0000 (16:10 +0200)]
selftests/net: Add missing config options for PMTU tests
PMTU tests in pmtu.sh need support for VTI, VTI6 and dummy
interfaces: add them to config file.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 May 2018 18:54:19 +0000 (14:54 -0400)]
Merge tag 'batadv-net-for-davem-
20180524' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- prevent hardif_put call with NULL parameter, by Colin Ian King
- Avoid race in Translation Table allocator, by Sven Eckelmann
- Fix Translation Table sync flags for intermediate Responses,
by Linus Luessing
- prevent sending inconsistent Translation Table TVLVs,
by Marek Lindner
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 25 May 2018 16:35:11 +0000 (09:35 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull more arm64 fixes from Will Deacon:
- fix application of read-only permissions to kernel section mappings
- sanitise reported ESR values for signals delivered on a kernel
address
- ensure tishift GCC helpers are exported to modules
- fix inline asm constraints for some LSE atomics
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Make sure permission updates happen for pmd/pud
arm64: fault: Don't leak data in ESR context for user fault on kernel VA
arm64: export tishift functions to modules
arm64: lse: Add early clobbers to some input/output asm operands
Linus Torvalds [Fri, 25 May 2018 16:32:00 +0000 (09:32 -0700)]
Merge tag 'powerpc-4.17-7' of git://git./linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Just one fix, to make sure the PCR (Processor Compatibility Register)
is reset on boot.
Otherwise if we're running in compat mode in a guest (eg. pretending a
Power9 is a Power8) and the host kernel oopses and kdumps then the
kdump kernel's userspace will be running in Power8 mode, and will
SIGILL if it uses Power9-only instructions.
Thanks to Michael Neuling"
* tag 'powerpc-4.17-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Clear PCR on boot
Linus Torvalds [Fri, 25 May 2018 16:29:17 +0000 (09:29 -0700)]
Merge tag 'mmc-v4.17-rc3' of git://git./linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Propagate correct error code for RPMB requests
MMC host:
- sdhci-iproc: Drop hard coded cap for 1.8v
- sdhci-iproc: Fix 32bit writes for transfer mode
- sdhci-iproc: Enable SDHCI_QUIRK2_HOST_OFF_CARD_ON for cygnus"
* tag 'mmc-v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-iproc: add SDHCI_QUIRK2_HOST_OFF_CARD_ON for cygnus
mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
mmc: block: propagate correct returned value in mmc_rpmb_ioctl
Linus Torvalds [Fri, 25 May 2018 16:15:13 +0000 (09:15 -0700)]
Merge tag 'drm-fixes-for-v4.17-rc7' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Only two sets of drivers fixes: one rcar-du lvds regression fix, and a
group of fixes for vmwgfx"
* tag 'drm-fixes-for-v4.17-rc7' of git://people.freedesktop.org/~airlied/linux:
drm/vmwgfx: Schedule an fb dirty update after resume
drm/vmwgfx: Fix host logging / guestinfo reading error paths
drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
drm: rcar-du: lvds: Fix crash in .atomic_check when disabling connector
Linus Torvalds [Fri, 25 May 2018 16:13:34 +0000 (09:13 -0700)]
Merge tag 'sound-4.17-rc7' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Two fixes:
- a timer pause event notification was garbled upon the recent
hardening work; corrected now
- HD-audio runtime PM regression fix due to the incorrect return
type"
* tag 'sound-4.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Fix runtime PM
ALSA: timer: Fix pause event notification
Qing Huang [Wed, 23 May 2018 23:22:46 +0000 (16:22 -0700)]
mlx4_core: allocate ICM memory in page size chunks
When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.
When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.
Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...
With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.
E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.
The solution is to use kvzalloc to replace kcalloc which will fall back
to vmalloc automatically if kmalloc fails.
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Acked-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Govindarajulu Varadarajan [Wed, 23 May 2018 18:17:39 +0000 (11:17 -0700)]
enic: set DMA mask to 47 bit
In commit
624dbf55a359b ("driver/net: enic: Try DMA 64 first, then
failover to DMA") DMA mask was changed from 40 bits to 64 bits.
Hardware actually supports only 47 bits.
Fixes: 624dbf55a359b ("driver/net: enic: Try DMA 64 first, then failover to DMA")
Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Biggers [Wed, 23 May 2018 21:37:38 +0000 (14:37 -0700)]
ppp: remove the PPPIOCDETACH ioctl
The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file
before f_count has reached 0, which is fundamentally a bad idea. It
does check 'f_count < 2', which excludes concurrent operations on the
file since they would only be possible with a shared fd table, in which
case each fdget() would take a file reference. However, it fails to
account for the fact that even with 'f_count == 1' the file can still be
linked into epoll instances. As reported by syzbot, this can trivially
be used to cause a use-after-free.
Yet, the only known user of PPPIOCDETACH is pppd versions older than
ppp-2.4.2, which was released almost 15 years ago (November 2003).
Also, PPPIOCDETACH apparently stopped working reliably at around the
same time, when the f_count check was added to the kernel, e.g. see
https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2'
check makes PPPIOCDETACH only work in single-threaded applications; it
always fails if called from a multithreaded application.
All pppd versions released in the last 15 years just close() the file
descriptor instead.
Therefore, instead of hacking around this bug by exporting epoll
internals to modules, and probably missing other related bugs, just
remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave
a stub in place that prints a one-time warning and returns EINVAL.
Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 23 May 2018 18:29:52 +0000 (14:29 -0400)]
ipv4: remove warning in ip_recv_error
A precondition check in ip_recv_error triggered on an otherwise benign
race. Remove the warning.
The warning triggers when passing an ipv6 socket to this ipv4 error
handling function. RaceFuzzer was able to trigger it due to a race
in setsockopt IPV6_ADDRFORM.
---
CPU0
do_ipv6_setsockopt
sk->sk_socket->ops = &inet_dgram_ops;
---
CPU1
sk->sk_prot->recvmsg
udp_recvmsg
ip_recv_error
WARN_ON_ONCE(sk->sk_family == AF_INET6);
---
CPU0
do_ipv6_setsockopt
sk->sk_family = PF_INET;
This socket option converts a v6 socket that is connected to a v4 peer
to an v4 socket. It updates the socket on the fly, changing fields in
sk as well as other structs. This is inherently non-atomic. It races
with the lockless udp_recvmsg path.
No other code makes an assumption that these fields are updated
atomically. It is benign here, too, as ip_recv_error cares only about
the protocol of the skbs enqueued on the error queue, for which
sk_family is not a precise predictor (thanks to another isue with
IPV6_ADDRFORM).
Link: http://lkml.kernel.org/r/20180518120826.GA19515@dragonet.kaist.ac.kr
Fixes: 7ce875e5ecb8 ("ipv4: warn once on passing AF_INET6 socket to ip_recv_error")
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Wed, 23 May 2018 16:24:48 +0000 (19:24 +0300)]
net : sched: cls_api: deal with egdev path only if needed
When dealing with ingress rule on a netdev, if we did fine through the
conventional path, there's no need to continue into the egdev route,
and we can stop right there.
Not doing so may cause a 2nd rule to be added by the cls api layer
with the ingress being the egdev.
For example, under sriov switchdev scheme, a user rule of VFR A --> VFR B
will end up with two HW rules (1) VF A --> VF B and (2) uplink --> VF B
Fixes: 208c0f4b5237 ('net: sched: use tc_setup_cb_call to call per-block callbacks')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 11:58:57 +0000 (19:58 +0800)]
vhost: synchronize IOTLB message with dev cleanup
DaeRyong Jeong reports a race between vhost_dev_cleanup() and
vhost_process_iotlb_msg():
Thread interleaving:
CPU0 (vhost_process_iotlb_msg) CPU1 (vhost_dev_cleanup)
(In the case of both VHOST_IOTLB_UPDATE and
VHOST_IOTLB_INVALIDATE)
===== =====
vhost_umem_clean(dev->iotlb);
if (!dev->iotlb) {
ret = -EFAULT;
break;
}
dev->iotlb = NULL;
The reason is we don't synchronize between them, fixing by protecting
vhost_process_iotlb_msg() with dev mutex.
Reported-by: DaeRyong Jeong <threeearcat@gmail.com>
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 25 May 2018 02:01:06 +0000 (22:01 -0400)]
Merge tag 'mlx5-fixes-2018-05-24' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Mellanox, mlx5 fixes 2018-05-24
This series includes two mlx5 fixes.
1) add FCS data to checksum complete when required, from Eran Ben
Elisha.
2) Fix A race in IPSec sandbox QP commands, from Yossi Kuperman.
Please pull and let me know if there's any problem.
for -stable v4.15
("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation")
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 24 May 2018 22:10:30 +0000 (18:10 -0400)]
packet: fix reserve calculation
Commit
b84bbaf7a6c8 ("packet: in packet_snd start writing at link
layer allocation") ensures that packet_snd always starts writing
the link layer header in reserved headroom allocated for this
purpose.
This is needed because packets may be shorter than hard_header_len,
in which case the space up to hard_header_len may be zeroed. But
that necessary padding is not accounted for in skb->len.
The fix, however, is buggy. It calls skb_push, which grows skb->len
when moving skb->data back. But in this case packet length should not
change.
Instead, call skb_reserve, which moves both skb->data and skb->tail
back, without changing length.
Fixes: b84bbaf7a6c8 ("packet: in packet_snd start writing at link layer allocation")
Reported-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Thu, 24 May 2018 23:47:56 +0000 (09:47 +1000)]
Merge branch 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux into drm-fixes
Three fixes for vmwgfx. Two are cc'd stable and fix host logging and its
error paths on 32-bit VMs. One is a fix for a hibernate flaw
introduced with the 4.17 merge window.
* 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux:
drm/vmwgfx: Schedule an fb dirty update after resume
drm/vmwgfx: Fix host logging / guestinfo reading error paths
drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
Linus Torvalds [Thu, 24 May 2018 21:42:43 +0000 (14:42 -0700)]
Merge branch 'stable/for-linus-4.17' of git://git./linux/kernel/git/konrad/swiotlb
Pull swiotlb fix from Konrad Rzeszutek Wilk:
"One single fix in here: under Xen the DMA32 heap (in the hypervisor)
would end up looking like swiss cheese.
The reason being that for every coherent DMA allocation we didn't do
the proper hypercall to tell Xen to return the page back to the DMA32
heap. End result was (eventually) no DMA32 space if you (for example)
continously unloaded and loaded modules"
* 'stable/for-linus-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
Yossi Kuperman [Tue, 17 Oct 2017 17:39:17 +0000 (20:39 +0300)]
net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
Sandbox QP Commands are retired in the order they are sent. Outstanding
commands are stored in a linked-list in the order they appear. Once a
response is received and the callback gets called, we pull the first
element off the pending list, assuming they correspond.
Sending a message and adding it to the pending list is not done atomically,
hence there is an opportunity for a race between concurrent requests.
Bind both send and add under a critical section.
Fixes: bebb23e6cb02 ("net/mlx5: Accel, Add IPSec acceleration interface")
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Adi Nissim <adin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eran Ben Elisha [Tue, 1 May 2018 13:25:07 +0000 (16:25 +0300)]
net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
When RXFCS feature is enabled, the HW do not strip the FCS data,
however it is not present in the checksum calculated by the HW.
Fix that by manually calculating the FCS checksum and adding it to the SKB
checksum field.
Add helper function to find the FCS data for all SKB forms (linear,
one fragment or more).
Fixes: 102722fc6832 ("net/mlx5e: Add support for RXFCS feature flag")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Linus Torvalds [Thu, 24 May 2018 21:12:05 +0000 (14:12 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This is pretty much just the usual array of smallish driver bugs.
- remove bouncing addresses from the MAINTAINERS file
- kernel oops and bad error handling fixes for hfi, i40iw, cxgb4, and
hns drivers
- various small LOC behavioral/operational bugs in mlx5, hns, qedr
and i40iw drivers
- two fixes for patches already sent during the merge window
- a long-standing bug related to not decreasing the pinned pages
count in the right MM was found and fixed"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits)
RDMA/hns: Move the location for initializing tmp_len
RDMA/hns: Bugfix for cq record db for kernel
IB/uverbs: Fix uverbs_attr_get_obj
RDMA/qedr: Fix doorbell bar mapping for dpi > 1
IB/umem: Use the correct mm during ib_umem_release
iw_cxgb4: Fix an error handling path in 'c4iw_get_dma_mr()'
RDMA/i40iw: Avoid panic when reading back the IRQ affinity hint
RDMA/i40iw: Avoid reference leaks when processing the AEQ
RDMA/i40iw: Avoid panic when objects are being created and destroyed
RDMA/hns: Fix the bug with NULL pointer
RDMA/hns: Set NULL for __internal_mr
RDMA/hns: Enable inner_pa_vld filed of mpt
RDMA/hns: Set desc_dma_addr for zero when free cmq desc
RDMA/hns: Fix the bug with rq sge
RDMA/hns: Not support qp transition from reset to reset for hip06
RDMA/hns: Add return operation when configured global param fail
RDMA/hns: Update convert function of endian format
RDMA/hns: Load the RoCE dirver automatically
RDMA/hns: Bugfix for rq record db for kernel
RDMA/hns: Add rq inline flags judgement
...
Linus Torvalds [Thu, 24 May 2018 18:47:43 +0000 (11:47 -0700)]
Merge tag 'for-4.17-rc6-tag' of git://git./linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A one-liner that prevents leaking an internal error value 1 out of the
ftruncate syscall.
This has been observed in practice. The steps to reproduce make a
common pattern (open/write/fync/ftruncate) but also need the
application to not check only for negative values and happens only for
compressed inlined files.
The conditions are narrow but as this could break userspace I think
it's better to merge it now and not wait for the merge window"
* tag 'for-4.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix error handling in btrfs_truncate()
Lukas Wunner [Thu, 24 May 2018 17:01:07 +0000 (19:01 +0200)]
ALSA: hda - Fix runtime PM
Before commit
3b5b899ca67d ("ALSA: hda: Make use of core codec functions
to sync power state"), hda_set_power_state() returned the response to
the Get Power State verb, a 32-bit unsigned integer whose expected value
is 0x233 after transitioning a codec to D3, and 0x0 after transitioning
it to D0.
The response value is significant because hda_codec_runtime_suspend()
does not clear the codec's bit in the codec_powered bitmask unless the
AC_PWRST_CLK_STOP_OK bit (0x200) is set in the response value. That in
turn prevents the HDA controller from runtime suspending because
azx_runtime_idle() checks that the codec_powered bitmask is zero.
Since commit
3b5b899ca67d, hda_set_power_state() only returns 0x0 or
0x1, thereby breaking runtime PM for any HDA controller. That's because
an inline function introduced by the commit returns a bool instead of a
32-bit unsigned int. The change was likely erroneous and resulted from
copying and pasting snd_hda_check_power_state(), which is immediately
preceding the newly introduced inline function. Fix it.
Link: https://bugs.freedesktop.org/show_bug.cgi?id=106597
Fixes: 3b5b899ca67d ("ALSA: hda: Make use of core codec functions to sync power state")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Abhijeet Kumar <abhijeet.kumar@intel.com>
Reported-and-tested-by: Gunnar Krüger <taijian@posteo.de>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Joonsoo Kim [Wed, 23 May 2018 01:18:21 +0000 (10:18 +0900)]
Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE"
This reverts the following commits that change CMA design in MM.
3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")
bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
Ville reported a following error on i386.
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
microcode: microcode updated early to revision 0x4, date = 2013-06-28
Initializing CPU#0
Initializing HighMem for node 0 (
000377fe:
00118000)
Initializing Movable for node 0 (
00000001:
00118000)
BUG: Bad page state in process swapper pfn:377fe
page:
f53effc0 count:0 mapcount:-127 mapping:
00000000 index:0x0
flags: 0x80000000()
raw:
80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
page dumped because: nonzero mapcount
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
Call Trace:
dump_stack+0x60/0x96
bad_page+0x9a/0x100
free_pages_check_bad+0x3f/0x60
free_pcppages_bulk+0x29d/0x5b0
free_unref_page_commit+0x84/0xb0
free_unref_page+0x3e/0x70
__free_pages+0x1d/0x20
free_highmem_page+0x19/0x40
add_highpages_with_active_regions+0xab/0xeb
set_highmem_pages_init+0x66/0x73
mem_init+0x1b/0x1d7
start_kernel+0x17a/0x363
i386_start_kernel+0x95/0x99
startup_32_smp+0x164/0x168
The reason for this error is that the span of MOVABLE_ZONE is extended
to whole node span for future CMA initialization, and, normal memory is
wrongly freed here. I submitted the fix and it seems to work, but,
another problem happened.
It's so late time to fix the later problem so I decide to reverting the
series.
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 24 May 2018 16:36:16 +0000 (09:36 -0700)]
Merge branch 'for-4.17-fixes' of git://git./linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
"Nothing too interesting. Four patches to update the blacklist and
add a controller ID"
* 'for-4.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ahci: Add PCI ID for Cannon Lake PCH-LP AHCI
libata: blacklist Micron 500IT SSD with MU01 firmware
libata: Apply NOLPM quirk for SAMSUNG PM830 CXM13D1Q.
libata: Blacklist some Sandisk SSDs for NCQ
Linus Torvalds [Thu, 24 May 2018 15:53:20 +0000 (08:53 -0700)]
Merge tag 'for-linus-
20180524' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Two fixes that should go into this release:
- a loop writeback error clearing fix from Jeff
- the sr sense fix from myself"
* tag 'for-linus-
20180524' of git://git.kernel.dk/linux-block:
loop: clear wb_err in bd_inode when detaching backing file
sr: pass down correctly sized SCSI sense buffer
Linus Torvalds [Thu, 24 May 2018 15:49:56 +0000 (08:49 -0700)]
Merge tag 'pm-4.17-rc7' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a regression from the 4.15 cycle that caused the system suspend
and resume overhead to increase on many systems and triggered more
serious problems on some of them (Rafael Wysocki)"
* tag 'pm-4.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / core: Fix direct_complete handling for devices with no callbacks
Daniel Borkmann [Thu, 24 May 2018 00:32:53 +0000 (02:32 +0200)]
bpf: properly enforce index mask to prevent out-of-bounds speculation
While reviewing the verifier code, I recently noticed that the
following two program variants in relation to tail calls can be
loaded.
Variant 1:
# bpftool p d x i 15
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:5]
3: (05) goto pc+2
4: (18) r2 = map[id:6]
6: (b7) r3 = 7
7: (35) if r3 >= 0xa0 goto pc+2
8: (54) (u32) r3 &= (u32) 255
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 5
5: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
# bpftool m s i 6
6: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
Variant 2:
# bpftool p d x i 20
0: (15) if r1 == 0x0 goto pc+3
1: (18) r2 = map[id:8]
3: (05) goto pc+2
4: (18) r2 = map[id:7]
6: (b7) r3 = 7
7: (35) if r3 >= 0x4 goto pc+2
8: (54) (u32) r3 &= (u32) 3
9: (85) call bpf_tail_call#12
10: (b7) r0 = 1
11: (95) exit
# bpftool m s i 8
8: prog_array flags 0x0
key 4B value 4B max_entries 160 memlock 4096B
# bpftool m s i 7
7: prog_array flags 0x0
key 4B value 4B max_entries 4 memlock 4096B
In both cases the index masking inserted by the verifier in order
to control out of bounds speculation from a CPU via
b2157399cc98
("bpf: prevent out-of-bounds speculation") seems to be incorrect
in what it is enforcing. In the 1st variant, the mask is applied
from the map with the significantly larger number of entries where
we would allow to a certain degree out of bounds speculation for
the smaller map, and in the 2nd variant where the mask is applied
from the map with the smaller number of entries, we get buggy
behavior since we truncate the index of the larger map.
The original intent from commit
b2157399cc98 is to reject such
occasions where two or more different tail call maps are used
in the same tail call helper invocation. However, the check on
the BPF_MAP_PTR_POISON is never hit since we never poisoned the
saved pointer in the first place! We do this explicitly for map
lookups but in case of tail calls we basically used the tail
call map in insn_aux_data that was processed in the most recent
path which the verifier walked. Thus any prior path that stored
a pointer in insn_aux_data at the helper location was always
overridden.
Fix it by moving the map pointer poison logic into a small helper
that covers both BPF helpers with the same logic. After that in
fixup_bpf_calls() the poison check is then hit for tail calls
and the program rejected. Latter only happens in unprivileged
case since this is the *only* occasion where a rewrite needs to
happen, and where such rewrite is specific to the map (max_entries,
index_mask). In the privileged case the rewrite is generic for
the insn->imm / insn->code update so multiple maps from different
paths can be handled just fine since all the remaining logic
happens in the instruction processing itself. This is similar
to the case of map lookups: in case there is a collision of
maps in fixup_bpf_calls() we must skip the inlined rewrite since
this will turn the generic instruction sequence into a non-
generic one. Thus the patch_call_imm will simply update the
insn->imm location where the bpf_map_lookup_elem() will later
take care of the dispatch. Given we need this 'poison' state
as a check, the information of whether a map is an unpriv_array
gets lost, so enforcing it prior to that needs an additional
state. In general this check is needed since there are some
complex and tail call intensive BPF programs out there where
LLVM tends to generate such code occasionally. We therefore
convert the map_ptr rather into map_state to store all this
w/o extra memory overhead, and the bit whether one of the maps
involved in the collision was from an unpriv_array thus needs
to be retained as well there.
Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Mika Westerberg [Thu, 24 May 2018 08:12:16 +0000 (11:12 +0300)]
ahci: Add PCI ID for Cannon Lake PCH-LP AHCI
This one should be using the default LPM policy for mobile chipsets so
add the PCI ID to the driver list of supported revices.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Laura Abbott [Wed, 23 May 2018 18:43:46 +0000 (11:43 -0700)]
arm64: Make sure permission updates happen for pmd/pud
Commit
15122ee2c515 ("arm64: Enforce BBM for huge IO/VMAP mappings")
disallowed block mappings for ioremap since that code does not honor
break-before-make. The same APIs are also used for permission updating
though and the extra checks prevent the permission updates from happening,
even though this should be permitted. This results in read-only permissions
not being fully applied. Visibly, this can occasionaly be seen as a failure
on the built in rodata test when the test data ends up in a section or
as an odd RW gap on the page table dump. Fix this by using
pgattr_change_is_safe instead of p*d_present for determining if the
change is permitted.
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Peter Robinson <pbrobinson@gmail.com>
Reported-by: Peter Robinson <pbrobinson@gmail.com>
Fixes: 15122ee2c515 ("arm64: Enforce BBM for huge IO/VMAP mappings")
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Omar Sandoval [Tue, 22 May 2018 16:47:58 +0000 (09:47 -0700)]
Btrfs: fix error handling in btrfs_truncate()
Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).
To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:
int main(void) {
char buf[256] = { 0 };
int ret;
int fd;
fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
if (fd == -1) {
perror("open");
return EXIT_FAILURE;
}
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write");
close(fd);
return EXIT_FAILURE;
}
if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return EXIT_FAILURE;
}
ret = ftruncate(fd, 128);
if (ret) {
printf("ftruncate() returned %d\n", ret);
close(fd);
return EXIT_FAILURE;
}
close(fd);
return EXIT_SUCCESS;
}
Fixes: ddfae63cc8e0 ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: Jun Wu <quark@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Eric Dumazet [Mon, 21 May 2018 23:35:24 +0000 (16:35 -0700)]
netfilter: provide correct argument to nla_strlcpy()
Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.
BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr
ffff8801ad1f4fdd by task syz-executor189/4509
CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
sock_write_iter+0x35a/0x5a0 net/socket.c:908
call_write_iter include/linux/fs.h:1784 [inline]
new_sync_write fs/read_write.c:474 [inline]
__vfs_write+0x64d/0x960 fs/read_write.c:487
vfs_write+0x1f8/0x560 fs/read_write.c:549
ksys_write+0xf9/0x250 fs/read_write.c:598
__do_sys_write fs/read_write.c:610 [inline]
__se_sys_write fs/read_write.c:607 [inline]
__x64_sys_write+0x73/0xb0 fs/read_write.c:607
Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
oulijun [Tue, 22 May 2018 12:47:15 +0000 (20:47 +0800)]
RDMA/hns: Move the location for initializing tmp_len
When posted work request, it need to compute the length of
all sges of every wr and fill it into the msg_len field of
send wqe. Thus, While posting multiple wr,
tmp_len should be reinitialized to zero.
Fixes: 8b9b8d143b46 ("RDMA/hns: Fix the endian problem for hns")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
oulijun [Tue, 22 May 2018 12:47:14 +0000 (20:47 +0800)]
RDMA/hns: Bugfix for cq record db for kernel
When use cq record db for kernel, it needs to set the hr_cq->db_en
to 1 and configure the dma address of record cq db of qp context.
Fixes: 86188a8810ed ("RDMA/hns: Support cq record doorbell for kernel space")
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 22 May 2018 21:56:51 +0000 (15:56 -0600)]
IB/uverbs: Fix uverbs_attr_get_obj
The err pointer comes from uverbs_attr_get, not from the uobject member,
which does not store an ERR_PTR.
Fixes: be934cca9e98 ("IB/uverbs: Add device memory registration ioctl support")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Kalderon, Michal [Tue, 15 May 2018 12:13:33 +0000 (15:13 +0300)]
RDMA/qedr: Fix doorbell bar mapping for dpi > 1
Each user_context receives a separate dpi value and thus a different
address on the doorbell bar. The qedr_mmap function needs to validate
the address and map the doorbell bar accordingly.
The current implementation always checked against dpi=0 doorbell range
leading to a wrong mapping for doorbell bar. (It entered an else case
that mapped the address differently). qedr_mmap should only be used
for doorbells, so the else was actually wrong in the first place.
This only has an affect on arm architecture and not an issue on a
x86 based architecture.
This lead to doorbells not occurring on arm based systems and left
applications that use more than one dpi (or several applications
run simultaneously ) to hang.
Fixes: ac1b36e55a51 ("qedr: Add support for user context verbs")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jack Morgenstein [Wed, 23 May 2018 07:41:59 +0000 (10:41 +0300)]
net/mlx4: Fix irq-unsafe spinlock usage
spin_lock/unlock was used instead of spin_un/lock_irq
in a procedure used in process space, on a spinlock
which can be grabbed in an interrupt.
This caused the stack trace below to be displayed (on kernel
4.17.0-rc1 compiled with Lock Debugging enabled):
[ 154.661474] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[ 154.668909] 4.17.0-rc1-rdma_rc_mlx+ #3 Tainted: G I
[ 154.675856] -----------------------------------------------------
[ 154.682706] modprobe/10159 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 154.690254]
00000000f3b0e495 (&(&qp_table->lock)->rlock){+.+.}, at: mlx4_qp_remove+0x20/0x50 [mlx4_core]
[ 154.700927]
and this task is already holding:
[ 154.707461]
0000000094373b5d (&(&cq->lock)->rlock/1){....}, at: destroy_qp_common+0x111/0x560 [mlx4_ib]
[ 154.718028] which would create a new lock dependency:
[ 154.723705] (&(&cq->lock)->rlock/1){....} -> (&(&qp_table->lock)->rlock){+.+.}
[ 154.731922]
but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 154.740798] (&(&cq->lock)->rlock){..-.}
[ 154.740800]
... which became SOFTIRQ-irq-safe at:
[ 154.752163] _raw_spin_lock_irqsave+0x3e/0x50
[ 154.757163] mlx4_ib_poll_cq+0x36/0x900 [mlx4_ib]
[ 154.762554] ipoib_tx_poll+0x4a/0xf0 [ib_ipoib]
...
to a SOFTIRQ-irq-unsafe lock:
[ 154.815603] (&(&qp_table->lock)->rlock){+.+.}
[ 154.815604]
... which became SOFTIRQ-irq-unsafe at:
[ 154.827718] ...
[ 154.827720] _raw_spin_lock+0x35/0x50
[ 154.833912] mlx4_qp_lookup+0x1e/0x50 [mlx4_core]
[ 154.839302] mlx4_flow_attach+0x3f/0x3d0 [mlx4_core]
Since mlx4_qp_lookup() is called only in process space, we can
simply replace the spin_un/lock calls with spin_un/lock_irq calls.
Fixes: 6dc06c08bef1 ("net/mlx4: Fix the check in attaching steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Wed, 23 May 2018 00:04:49 +0000 (17:04 -0700)]
net: phy: broadcom: Fix bcm_write_exp()
On newer PHYs, we need to select the expansion register to write with
setting bits [11:8] to 0xf. This was done correctly by bcm7xxx.c prior
to being migrated to generic code under bcm-phy-lib.c which
unfortunately used the older implementation from the BCM54xx days.
Fix this by creating an inline stub: bcm_write_exp_sel() which adds the
correct value (MII_BCM54XX_EXP_SEL_ER) and update both the Cygnus PHY
and BCM7xxx PHY drivers which require setting these bits.
broadcom.c is unchanged because some PHYs even use a different selector
method, so let them specify it directly (e.g: SerDes secondary selector).
Fixes: a1cba5613edf ("net: phy: Add Broadcom phy library for common interfaces")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 22 May 2018 23:22:26 +0000 (16:22 -0700)]
net: phy: broadcom: Fix auxiliary control register reads
We are currently doing auxiliary control register reads with the shadow
register value 0b111 (0x7) which incidentally is also the selector value
that should be present in bits [2:0]. Fix this by using the appropriate
selector mask which is defined (MII_BCM54XX_AUXCTL_SHDWSEL_MASK).
This does not have a functional impact yet because we always access the
MII_BCM54XX_AUXCTL_SHDWSEL_MISC (0x7) register in the current code.
This might change at some point though.
Fixes: 5b4e29005123 ("net: phy: broadcom: add bcm54xx_auxctl_read")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Tue, 22 May 2018 20:44:51 +0000 (13:44 -0700)]
net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 22 May 2018 15:42:51 +0000 (16:42 +0100)]
net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
Trivial fix to spelling mistake in mlx4_dbg debug message and also
change the phrasing of the message so that is is more readable
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Fontenot [Tue, 22 May 2018 16:21:10 +0000 (11:21 -0500)]
ibmvnic: Only do H_EOI for mobility events
When enabling the sub-CRQ IRQ a previous update sent a H_EOI prior
to the enablement to clear any pending interrupts that may be present
across a partition migration. This fixed a firmware bug where a
migration could erroneously indicate that a H_EOI was pending.
The H_EOI should only be sent when enabling during a mobility
event though. Doing so at other time could wrong and can produce
extra driver output when IRQs are enabled when doing TX completion.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 May 2018 18:45:42 +0000 (14:45 -0400)]
Merge tag 'wireless-drivers-for-davem-2018-05-22' of git://git./linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 4.17
Hopefully the last fixes for 4.17. ssb is again causing problems so we
had to revert a commit and fix it better. Also a small fix to bcma and
some MAINTAINERS file updates.
ssb
* fix regression with all module PCI cards, for example using b43 and
b44 drivers
* try again fixing a MIPS linker error
bcma
* fix truncated info log messages
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 06:21:04 +0000 (14:21 +0800)]
tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
When link is down, writes to the device might fail with
-EIO. Userspace needs an indication when the status is resolved. As a
fix, tun_net_open() attempts to wake up writers - but that is only
effective if SOCKWQ_ASYNC_NOSPACE has been set in the past. This is
not the case of vhost_net which only poll for EPOLLOUT after it meets
errors during sendmsg().
This patch fixes this by making sure SOCKWQ_ASYNC_NOSPACE is set when
socket is not writable or device is down to guarantee EPOLLOUT will be
raised in either tun_chr_poll() or tun_sock_write_space() after device
is up.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 May 2018 17:36:19 +0000 (13:36 -0400)]
Merge branch 'virtio_net-mergeable-XDP'
Jason Wang says:
====================
Fix several issues of virtio-net mergeable XDP
Please review the patches that tries to fix several issues of
virtio-net mergeable XDP.
Changes from V1:
- check against 1 before decreasing instead of resetting to 1
- typoe fixes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 03:44:31 +0000 (11:44 +0800)]
virtio-net: fix leaking page for gso packet during mergeable XDP
We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic. While at it, remove useless comment as well.
Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 03:44:30 +0000 (11:44 +0800)]
virtio-net: correctly check num_buf during err path
If we successfully linearize the packet, num_buf will be set to zero
which may confuse error handling path which assumes num_buf is at
least 1 and this can lead the code tries to pop the descriptor of next
buffer. Fixing this by checking num_buf against 1 before decreasing.
Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 03:44:29 +0000 (11:44 +0800)]
virtio-net: correctly transmit XDP buff after linearizing
We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.
Fixes: 72979a6c3590 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 22 May 2018 03:44:28 +0000 (11:44 +0800)]
virtio-net: correctly redirect linearized packet
After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.
Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 23 May 2018 15:50:05 +0000 (11:50 -0400)]
Merge tag 'mac80211-for-davem-2018-05-23' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A handful of fixes:
* hwsim radio dump wasn't working for the first radio
* mesh was updating statistics incorrectly
* a netlink message allocation was possibly too short
* wiphy name limit was still too long
* in certain cases regdb query could find a NULL pointer
====================
Signed-off-by: David S. Miller <davem@davemloft.net>