Trond Myklebust [Sun, 24 Jul 2016 21:08:59 +0000 (17:08 -0400)]
Merge branch 'pnfs'
Trond Myklebust [Sun, 24 Jul 2016 21:08:31 +0000 (17:08 -0400)]
Merge branch 'writeback'
Trond Myklebust [Sun, 24 Jul 2016 21:08:31 +0000 (17:08 -0400)]
Merge branch 'sunrpc'
Trond Myklebust [Sun, 24 Jul 2016 21:06:28 +0000 (17:06 -0400)]
SUNRPC: Fix a compiler warning in fs/nfs/clnt.c
Fix the report:
net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 19:14:44 +0000 (15:14 -0400)]
pNFS: Remove redundant smp_mb() from pnfs_init_lseg()
It's not visible yet, and won't be until after we grab the inode->i_lock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 19:10:12 +0000 (15:10 -0400)]
pNFS: Cleanup - do layout segment initialisation in one place
...instead of splitting the initialisation over init_lseg() and
pnfs_layout_process().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 21 Jul 2016 18:45:19 +0000 (14:45 -0400)]
pNFS: Remove redundant stateid invalidation
The layout stateid will be invalidated once it holds no more layout
segments anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 16:45:47 +0000 (12:45 -0400)]
pNFS: Remove redundant pnfs_mark_layout_returned_if_empty()
That's already being taken care of in pnfs_layout_remove_lseg().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 19:04:07 +0000 (15:04 -0400)]
pNFS: Clear the layout metadata if the server changed the layout stateid
If the server changed the layout stateid's "other" field, then
we should treat the old layout as being completely gone. In that
case, we want to clear the metadata such as scheduled layoutreturns.
Do this by calling pnfs_mark_layout_stateid_invalid().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Fri, 22 Jul 2016 15:25:27 +0000 (11:25 -0400)]
pNFS: Cleanup - don't open code pnfs_mark_layout_stateid_invalid()
Ensure nfs42_layoutstat_done() layoutget don't open code layout stateid
invalidation.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Fri, 22 Jul 2016 15:13:22 +0000 (11:13 -0400)]
NFS: pnfs_mark_matching_lsegs_return() should match the layout sequence id
When determining which layout segments to return, we do want
pnfs_mark_matching_lsegs_return to check that they match the layout
sequence id. This ensures that we don't waste time if the server
is replaying a layout recall that has already been satisfied.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 21 Jul 2016 17:06:18 +0000 (13:06 -0400)]
pNFS: Do not set plh_return_seq for non-callback related layoutreturns
In cases where we need to send a layoutreturn in order to propagate
an error, we should not tie that to a specific layout stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 21 Jul 2016 16:44:15 +0000 (12:44 -0400)]
pNFS: Ensure layoutreturn acts as a completion for layout callbacks
When we return NFS_OK to the CB_LAYOUTRECALL, we are required to
send a layoutreturn that "completes" that layout recall request, using
the correct stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 01:11:43 +0000 (21:11 -0400)]
pNFS: Fix CB_LAYOUTRECALL stateid verification
We want to evaluate in this order:
If the client holds no layout for this inode, then return
NFS4ERR_NOMATCHING_LAYOUT; it probably forgot the layout.
If the client finds the inode among the list of layouts, but the corresponding
stateid has not yet been initialised, then return NFS4ERR_DELAY to ask the
server to retry once the outstanding LAYOUTGET is complete.
If the current layout stateid's "other" field does not match the recalled
stateid, return NFS4ERR_BAD_STATEID.
If already processing a layout recall with a newer stateid, return
NFS4ERR_OLD_STATEID. This can only happens for servers that are
non-compliant with the NFSv4.1 protocol.
If already processing a layout recall with an older stateid, return
NFS4ERR_DELAY to ask the server to retry once the outstanding
LAYOUTRETURN is complete. Again, this is technically incompliant with
the NFSv4.1 protocol.
If the current layout sequence id is newer than the recalled stateid's
sequence id, return NFS4ERR_OLD_STATEID. This too implies protocol
non-compliance.
If the current layout sequence id is older than the recalled stateid's
sequence id+1, return NFS4ERR_DELAY.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 15:46:06 +0000 (11:46 -0400)]
pNFS: Always update the layout barrier seqid on LAYOUTGET
Currently, pnfs_set_layout_stateid() will update the layout sequence
id barrier only if the stateid itself is newer than the current
layout stateid. However in a situation where multiple LAYOUTGET calls
and a LAYOUTRETURN raced, it is entirely possible for one of the
LAYOUTGET to set the current stateid to something newer than the
LAYOUTRETURN that needs to set the barrier.
The fix is to allow the "update_barrier" flag to force a check as to
whether or not the barrier needs to be updated.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 15:39:03 +0000 (11:39 -0400)]
pNFS: Always update the layout stateid if NFS_LAYOUT_INVALID_STID is set
If the layout stateid is invalid, then pnfs_set_layout_stateid() must
always initialise it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 21 Jul 2016 15:53:29 +0000 (11:53 -0400)]
pNFS: Clear the layout return tracking on layout reinitialisation
Ensure that we don't carry over layoutreturn info from a previous
incarnation of this layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 16:26:34 +0000 (12:26 -0400)]
pNFS: LAYOUTRETURN should only update the stateid if the layout is valid
If the layout was completely returned, then ignore the returned layout
stateid.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 24 Jul 2016 16:51:10 +0000 (12:51 -0400)]
Merge commit '
e7bdea7750eb'
Needed in order to work on top of pNFS changes in Linus' upstream kernel.
Benjamin Coddington [Mon, 18 Jul 2016 14:41:57 +0000 (10:41 -0400)]
nfs: don't create zero-length requests
NFS doesn't expect requests with wb_bytes set to zero and may make
unexpected decisions about how to handle that request at the page IO layer.
Skip request creation if we won't have any wb_bytes in the request.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Artem Savkov [Thu, 21 Jul 2016 11:32:04 +0000 (13:32 +0200)]
Fix NULL pointer dereference in bl_free_device().
When bl_parse_deviceid() fails in bl_alloc_deviceid_node() on
blkdev_get_by_*() step we get an pnfs_block_dev struct that is
uninitialized except for bdev field which is set to whatever error
blkdev_get_by_*() returns. bl_free_device() then tries to call
blkdev_put() if bdev is not 0 resulting in a wrong pointer dereference.
Fixing this by setting bdev in struct pnfs_block_dev only if we didn't
get an error from blkdev_get_by_*().
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 21 Jul 2016 13:43:43 +0000 (09:43 -0400)]
pNFS/files: filelayout_write_done_cb must call nfs_writeback_update_inode()
All write callbacks are required to call nfs_writeback_update_inode() upon
success to ensure that file size changes are recorded, and the attribute
cache is invalidated.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Frank Sorenson [Fri, 8 Jul 2016 21:35:25 +0000 (16:35 -0500)]
sunrpc: Prevent resvport min/max inversion via sysfs and module parameter
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysfs and
module parameter by setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Frank Sorenson [Fri, 8 Jul 2016 21:35:24 +0000 (16:35 -0500)]
sunrpc: Prevent resvport min/max inversion via sysctl
The current min/max resvport settings are independently limited
by the entire range of allowed ports, so max_resvport can be
set to a port lower than min_resvport.
Prevent inversion of min/max values when set through sysctl by
setting the limits dependent on each other.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Frank Sorenson [Fri, 8 Jul 2016 21:35:23 +0000 (16:35 -0500)]
sunrpc: Fix reserved port range calculation
The range calculation for choosing the random reserved port will panic
with divide-by-zero when min_resvport == max_resvport, a range of one
port, not zero.
Fix the reserved port range calculation by adding one to the difference.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Frank Sorenson [Mon, 27 Jun 2016 19:17:19 +0000 (15:17 -0400)]
sunrpc: Fix bit count when setting hashtable size to power-of-two
Author: Frank Sorenson <sorenson@redhat.com>
Date: 2016-06-27 13:55:48 -0500
sunrpc: Fix bit count when setting hashtable size to power-of-two
The hashtable size is incorrectly calculated as the next higher
power-of-two when being set to a power-of-two. fls() returns the
bit number of the most significant set bit, with the least
significant bit being numbered '1'. For a power-of-two, fls()
will return a bit number which is one higher than the number of bits
required, leading to a hashtable which is twice the requested size.
In addition, the value of (1 << nbits) will always be at least num,
so the test will never be true.
Fix the hash table size calculation to correctly set hashtable
size, and eliminate the unnecessary check.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tigran Mkrtchyan [Mon, 13 Jun 2016 18:52:00 +0000 (20:52 +0200)]
nfs4: flexfiles: respect noresvport when establishing connections to DSes
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tigran Mkrtchyan [Mon, 13 Jun 2016 17:57:35 +0000 (19:57 +0200)]
nfs4: clnt: respect noresvport when establishing connections to DSes
result:
$ mount -o vers=4.1 dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp 0 0 131.169.185.68:1005 131.169.191.141:32049 ESTABLISHED
tcp 0 0 131.169.185.68:751 131.169.191.144:2049 ESTABLISHED
$
$ mount -o vers=4.1,noresvport dcache-lab007:/ /pnfs
$ cp /etc/profile /pnfs
tcp 0 0 131.169.185.68:34894 131.169.191.141:32049 ESTABLISHED
tcp 0 0 131.169.185.68:35722 131.169.191.144:2049 ESTABLISHED
$
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Benjamin Coddington [Fri, 10 Jun 2016 20:37:35 +0000 (16:37 -0400)]
pnfs/blocklayout: put deviceid node after releasing bl_ext_lock
The last put of deviceid nodes for SCSI layouts may sleep, so we shouldn't
hold any spinlocks. Make sure we put them outside the bl_ext_lock.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Scott Mayhew [Tue, 7 Jun 2016 19:14:48 +0000 (15:14 -0400)]
sunrpc: move NO_CRKEY_TIMEOUT to the auth->au_flags
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
not really safe to use the the generic_cred->acred->ac_flags to store
the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the
KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
with the auth_cred to be in a state where they're perpetually doing 4K
NFS_FILE_SYNC writes.
This can be reproduced as follows:
1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
They do not need to be the same export, nor do they even need to be from
the same NFS server. Also, v3 is fine.
$ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
$ sudo mount -o v3,sec=sys server2:/export /mnt/sys
2. As the normal user, before accessing the kerberized mount, kinit with
a short lifetime (but not so short that renewing the ticket would leave
you within the 4-minute window again by the time the original ticket
expires), e.g.
$ kinit -l 10m -r 60m
3. Do some I/O to the kerberized mount and verify that the writes are
wsize, UNSTABLE:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
4. Wait until you're within 4 minutes of key expiry, then do some more
I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
set. Verify that the writes are 4K, FILE_SYNC:
$ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1
5. Now do some I/O to the sec=sys mount. This will cause
RPC_CRED_NO_CRKEY_TIMEOUT to be set:
$ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1
6. Writes for that user will now be permanently 4K, FILE_SYNC for that
user, regardless of which mount is being written to, until you reboot
the client. Renewing the kerberos ticket (assuming it hasn't already
expired) will have no effect. Grabbing a new kerberos ticket at this
point will have no effect either.
Move the flag to the auth->au_flags field (which is currently unused)
and rename it slightly to reflect that it's no longer associated with
the auth_cred->ac_flags. Add the rpc_auth to the arg list of
rpcauth_cred_key_to_expire and check the au_flags there too. Finally,
add the inode to the arg list of nfs_ctx_key_to_expire so we can
determine the rpc_auth to pass to rpcauth_cred_key_to_expire.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Steve Dickson [Wed, 25 May 2016 14:36:50 +0000 (10:36 -0400)]
mount: use sec= that was specified on the command line
When older servers return RPC_AUTH_NULL, it means the
rpc creds will be ignored. In that case use the sec=
that was specified instead of setting sec=null
Fixes Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=
1112983
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 14 Jul 2016 19:14:02 +0000 (15:14 -0400)]
pNFS: Fix LAYOUTGET handling of NFS4ERR_BAD_STATEID and NFS4ERR_EXPIRED
We want to recover the open stateid if there is no layout stateid
and/or the stateid argument matches an open stateid.
Otherwise throw out the existing layout and recover from scratch, as
the layout stateid is bad.
Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Trond Myklebust [Thu, 14 Jul 2016 18:28:31 +0000 (14:28 -0400)]
pNFS: Handle NFS4ERR_RECALLCONFLICT correctly in LAYOUTGET
Instead of giving up altogether and falling back to doing I/O
through the MDS, which may make the situation worse, wait for
2 lease periods for the callback to resolve itself, and then
try destroying the existing layout.
Only if this was an attempt at getting a first layout, do we
give up altogether, as the server is clearly crazy.
Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Trond Myklebust [Thu, 14 Jul 2016 22:46:24 +0000 (18:46 -0400)]
pNFS: Separate handling of NFS4ERR_LAYOUTTRYLATER and RECALLCONFLICT
They are not the same error, and need to be handled differently.
Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Trond Myklebust [Thu, 14 Jul 2016 22:34:12 +0000 (18:34 -0400)]
pNFS: Fix post-layoutget error handling in pnfs_update_layout()
The non-retry error path is currently broken and ends up releasing the
reference to the layout twice. It also can end up clearing the
NFS_LAYOUT_FIRST_LAYOUTGET flag twice, causing a race.
In addition, the retry path will fail to decrement the plh_outstanding
counter.
Fixes: 183d9e7b112aa ("pnfs: rework LAYOUTGET retry handling")
Cc: stable@vger.kernel.org # 4.7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Trond Myklebust [Mon, 18 Jul 2016 04:51:01 +0000 (00:51 -0400)]
pNFS: Don't mark the inode as revalidated if a LAYOUTCOMMIT is outstanding
We know that the attributes will need updating if there is still a
LAYOUTCOMMIT outstanding.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 16 Jul 2016 15:47:00 +0000 (11:47 -0400)]
SUNRPC: Fix infinite looping in rpc_clnt_iterate_for_each_xprt
If there were less than 2 entries in the multipath list, then
xprt_iter_next_entry_multiple() would never advance beyond the
first entry, which is correct for round robin behaviour, but not
for the list iteration.
The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt()
as we would never see the xprt == NULL condition fulfilled.
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Fixes: 80b14d5e61ca ("SUNRPC: Add a structure to track multiple transports")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Kinglong Mee [Thu, 14 Jul 2016 04:02:01 +0000 (12:02 +0800)]
nfs/blocklayout: Check max uuids and devices before decoding
Avoid nfs return uuids/devices larger than maximum.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Kinglong Mee [Thu, 14 Jul 2016 04:01:28 +0000 (12:01 +0800)]
nfs/blocklayout: Make sure calculate signature length aligned
Avoid a bad nfs server return an unaligned length of signature.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Christoph Hellwig [Fri, 8 Jul 2016 09:41:30 +0000 (18:41 +0900)]
nfs/blocklayout: support RH/Fedora dm-mpath device nodes
Instead of reusing the wwn-* names for multipath devices nodes RHEL and
Fedora introduce new dm-mpath-uuid-* nodes with a slightly different
naming scheme. Try these names first to ensure we always get a
multipath-capable device if it exists.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Christoph Hellwig [Fri, 8 Jul 2016 09:41:29 +0000 (18:41 +0900)]
nfs/blocklayout: refactor open-by-wwn
The current code works with the standard udev/systemd names, but we'll have
to add another method in the next patch. Refactor it into a separate helper
to make room for the new variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Christoph Hellwig [Fri, 8 Jul 2016 09:41:28 +0000 (18:41 +0900)]
nfs/blocklayout: use proper fmode for opening block devices
This was fixed for the original block layout code a while ago, but also
needs to be fixed for the SCSI layout path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 14 Jul 2016 16:42:40 +0000 (12:42 -0400)]
NFSv4: Revert "Truncating file opens should also sync O_DIRECT writes"
We're not holding any locks, so both nfs_wb_all() and inode_dio_wait()
are unenforcible and have livelock potential. Just limit ourselves to
flushing out the data.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 23 Jun 2016 15:09:04 +0000 (11:09 -0400)]
NFS nfs_vm_page_mkwrite: Don't freeze me, Bro...
Prevent filesystem freezes while handling the write page fault.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 25 Jun 2016 21:57:39 +0000 (17:57 -0400)]
NFSv4.2: llseek(SEEK_HOLE) and llseek(SEEK_DATA) don't require data sync
We want to ensure that we write the cached data to the server, but
don't require it be synced to disk. If the server reboots, we will
get a stateid error, which will cause us to retry anyway.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 25 Jun 2016 22:12:03 +0000 (18:12 -0400)]
NFSv4.2: Fix writeback races in nfs4_copy_file_range
We need to ensure that any writes to the destination file are serialised
with the copy, meaning that the writeback has to occur under the inode lock.
Also relax the writeback requirement on the source, and rely on the
stateid checking to tell us if the source rebooted. Add the helper
nfs_filemap_write_and_wait_range() to call pnfs_sync_inode() as
is appropriate for pNFS servers that may need a layoutcommit.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 25 Jun 2016 21:50:53 +0000 (17:50 -0400)]
NFSv4.2: Fix a race in nfs42_proc_deallocate()
When punching holes in a file, we want to ensure the operation is
serialised w.r.t. other writes, meaning that we want to call
nfs_sync_inode() while holding the inode lock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 25 Jun 2016 21:45:40 +0000 (17:45 -0400)]
NFS: Getattr doesn't require data sync semantics
When retrieving stat() information, NFS unfortunately does require us to
sync writes to disk in order to ensure that mtime and ctime are up to
date. However we shouldn't have to ensure that those writes are persisted.
Relaxing that requirement does mean that we may see an mtime/ctime change
if the server reboots and forces us to replay all writes.
The exception to this rule are pNFS clients that are required to send
layoutcommit, however that is dealt with by the call to pnfs_sync_inode()
in _nfs_revalidate_inode().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sat, 25 Jun 2016 21:24:46 +0000 (17:24 -0400)]
NFS: Do not aggressively cache file attributes in the case of O_DIRECT
A file that is open for O_DIRECT is by definition not obeying
close-to-open cache consistency semantics, so let's not cache
the attributes too aggressively either.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 22 Jun 2016 12:19:36 +0000 (08:19 -0400)]
NFS: Remove unused function nfs_revalidate_mapping_protected()
Clean up...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 23 Jun 2016 13:55:48 +0000 (09:55 -0400)]
NFS: Remove redundant waits for O_DIRECT in fsync() and write_begin()
We're now waiting immediately after taking the locks, so waiting
in fsync() and write_begin() is either redundant or potentially
subject to livelock (if not holding the lock).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 23 Jun 2016 13:29:47 +0000 (09:29 -0400)]
NFS: Cleanup nfs_direct_complete()
There is only one caller that sets the "write" argument to true,
so just move the call to nfs_zap_mapping() and get rid of the
now redundant argument.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Fri, 3 Jun 2016 21:07:19 +0000 (17:07 -0400)]
NFS: Do not serialise O_DIRECT reads and writes
Allow dio requests to be scheduled in parallel, but ensuring that they
do not conflict with buffered I/O.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 23 Jun 2016 19:00:42 +0000 (15:00 -0400)]
NFS: Move buffered I/O locking into nfs_file_write()
Preparation for the patch that de-serialises O_DIRECT reads and
writes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 23 Jun 2016 14:35:48 +0000 (10:35 -0400)]
NFS Cleanup: move call to generic_write_checks() into fs/nfs/direct.c
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 22 Jun 2016 18:38:06 +0000 (14:38 -0400)]
NFS: Remove racy size manipulations in O_DIRECT
On success, the RPC callbacks will ensure that we make the appropriate calls
to nfs_writeback_update_inode()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 2 Jun 2016 01:42:32 +0000 (21:42 -0400)]
NFS: Ensure we reset the write verifier 'committed' value on resend.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Thu, 2 Jun 2016 01:32:24 +0000 (21:32 -0400)]
NFS: Fix O_DIRECT verifier problems
We should not be interested in looking at the value of the stable field,
since that could take any value.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 5 Jul 2016 23:08:58 +0000 (19:08 -0400)]
pNFS: pnfs_layoutcommit_outstanding() is no longer used when !CONFIG_NFS_V4_1
Cleanup...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 5 Jul 2016 17:46:53 +0000 (13:46 -0400)]
pNFS: Ensure we layoutcommit before revalidating attributes
If we need to update the cached attributes, then we'd better make
sure that we also layoutcommit first. Otherwise, the server may have stale
attributes.
Prior to this patch, the revalidation code tried to "fix" this problem by
simply disabling attributes that would be affected by the layoutcommit.
That approach breaks nfs_writeback_check_extend(), leading to a file size
corruption.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 26 Jun 2016 22:54:58 +0000 (18:54 -0400)]
pNFS: Files and flexfiles always need to commit before layoutcommit
So ensure that we mark the layout for commit once the write is done,
and then ensure that the commit to ds is finished before sending
layoutcommit.
Note that by doing this, we're able to optimise away the commit
for the case of servers that don't need layoutcommit in order to
return updated attributes.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 26 Jun 2016 20:14:40 +0000 (16:14 -0400)]
pNFS/flexfiles: Clean up calls to pnfs_set_layoutcommit()
Let's just have one place where we check ff_layout_need_layoutcommit().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 26 Jun 2016 16:39:49 +0000 (12:39 -0400)]
pNFS/flexfiles: Fix layoutcommit after a commit to DS
We should always do a layoutcommit after commit to DS, except if
the layout segment we're using has set FF_FLAGS_NO_LAYOUTCOMMIT.
Fixes: d67ae825a59d ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 26 Jun 2016 16:27:25 +0000 (12:27 -0400)]
pNFS/files: Fix layoutcommit after a commit to DS
According to the errata
https://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
we should always send layout commit after a commit to DS.
Fixes: bc7d4b8fd091 ("nfs/filelayout: set layoutcommit...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Sun, 26 Jun 2016 12:44:35 +0000 (08:44 -0400)]
NFSv4: Allow retry of operations that used a returned delegation stateid
Fix up nfs4_do_handle_exception() so that it can check if the operation
that received the NFS4ERR_BAD_STATEID was using a defunct delegation.
Apply that to the case of SETATTR, which will currently return EIO
in some cases where this happens.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Tue, 28 Jun 2016 17:54:09 +0000 (13:54 -0400)]
NFS/pnfs: Do not clobber existing pgio_done_cb in nfs4_proc_read_setup
If a pNFS client sets hdr->pgio_done_cb, then we should not overwrite that
in nfs4_proc_read_setup()
Fixes: 75bf47ebf6b5 ("pNFS/flexfile: Fix erroneous fall back to...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Trond Myklebust [Wed, 22 Jun 2016 18:13:12 +0000 (14:13 -0400)]
NFS: Fix an Oops in the pNFS files and flexfiles connection setup to the DS
Chris Worley reports:
RIP: 0010:[<
ffffffffa0245f80>] [<
ffffffffa0245f80>] rpc_new_client+0x2a0/0x2e0 [sunrpc]
RSP: 0018:
ffff880158f6f548 EFLAGS:
00010246
RAX:
0000000000000000 RBX:
ffff880234f8bc00 RCX:
000000000000ea60
RDX:
0000000000074cc0 RSI:
000000000000ea60 RDI:
ffff880234f8bcf0
RBP:
ffff880158f6f588 R08:
000000000001ac80 R09:
ffff880237003300
R10:
ffff880201171000 R11:
ffffea0000d75200 R12:
ffffffffa03afc60
R13:
ffff880230c18800 R14:
0000000000000000 R15:
ffff880158f6f680
FS:
00007f0e32673740(0000) GS:
ffff88023fc40000(0000) knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
CR2:
0000000000000008 CR3:
0000000234886000 CR4:
00000000001406e0
Stack:
ffffffffa047a680 0000000000000000 ffff880158f6f598 ffff880158f6f680
ffff880158f6f680 ffff880234d11d00 ffff88023357f800 ffff880158f6f7d0
ffff880158f6f5b8 ffffffffa024660a ffff880158f6f5b8 ffffffffa02492ec
Call Trace:
[<
ffffffffa024660a>] rpc_create_xprt+0x1a/0xb0 [sunrpc]
[<
ffffffffa02492ec>] ? xprt_create_transport+0x13c/0x240 [sunrpc]
[<
ffffffffa0246766>] rpc_create+0xc6/0x1a0 [sunrpc]
[<
ffffffffa038e695>] nfs_create_rpc_client+0xf5/0x140 [nfs]
[<
ffffffffa038f31a>] nfs_init_client+0x3a/0xd0 [nfs]
[<
ffffffffa038f22f>] nfs_get_client+0x25f/0x310 [nfs]
[<
ffffffffa025cef8>] ? rpc_ntop+0xe8/0x100 [sunrpc]
[<
ffffffffa047512c>] nfs3_set_ds_client+0xcc/0x100 [nfsv3]
[<
ffffffffa041fa10>] nfs4_pnfs_ds_connect+0x120/0x400 [nfsv4]
[<
ffffffffa03d41c7>] nfs4_ff_layout_prepare_ds+0xe7/0x330 [nfs_layout_flexfiles]
[<
ffffffffa03d1b1b>] ff_layout_pg_init_write+0xcb/0x280 [nfs_layout_flexfiles]
[<
ffffffffa03a14dc>] __nfs_pageio_add_request+0x12c/0x490 [nfs]
[<
ffffffffa03a1fa2>] nfs_pageio_add_request+0xc2/0x2a0 [nfs]
[<
ffffffffa03a0365>] ? nfs_pageio_init+0x75/0x120 [nfs]
[<
ffffffffa03a5b50>] nfs_do_writepage+0x120/0x270 [nfs]
[<
ffffffffa03a5d31>] nfs_writepage_locked+0x61/0xc0 [nfs]
[<
ffffffff813d4115>] ? __percpu_counter_add+0x55/0x70
[<
ffffffffa03a6a9f>] nfs_wb_single_page+0xef/0x1c0 [nfs]
[<
ffffffff811ca4a3>] ? __dec_zone_page_state+0x33/0x40
[<
ffffffffa0395b21>] nfs_launder_page+0x41/0x90 [nfs]
[<
ffffffff811baba0>] invalidate_inode_pages2_range+0x340/0x3a0
[<
ffffffff811bac17>] invalidate_inode_pages2+0x17/0x20
[<
ffffffffa039960e>] nfs_release+0x9e/0xb0 [nfs]
[<
ffffffffa0399570>] ? nfs_open+0x60/0x60 [nfs]
[<
ffffffffa0394dad>] nfs_file_release+0x3d/0x60 [nfs]
[<
ffffffff81226e6c>] __fput+0xdc/0x1e0
[<
ffffffff81226fbe>] ____fput+0xe/0x10
[<
ffffffff810bf2e4>] task_work_run+0xc4/0xe0
[<
ffffffff810a4188>] do_exit+0x2e8/0xb30
[<
ffffffff8102471c>] ? do_audit_syscall_entry+0x6c/0x70
[<
ffffffff811464e6>] ? __audit_syscall_exit+0x1e6/0x280
[<
ffffffff810a4a5f>] do_group_exit+0x3f/0xa0
[<
ffffffff810a4ad4>] SyS_exit_group+0x14/0x20
[<
ffffffff8179b76e>] system_call_fastpath+0x12/0x71
Which seems to be due to a call to utsname() when in a task exit context
in order to determine the hostname to set in rpc_new_client().
In reality, what we want here is not the hostname of the current task, but
the hostname that was used to set up the metadata server.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Linus Torvalds [Thu, 30 Jun 2016 16:57:52 +0000 (09:57 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM and x86 fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.
KVM: LAPIC: cap __delay at lapic_timer_advance_ns
KVM: x86: move nsec_to_cycles from x86.c to x86.h
pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags
pvclock: Cleanup to remove function pvclock_get_nsec_offset
pvclock: Add CPU barriers to get correct version value
KVM: arm/arm64: Stop leaking vcpu pid references
arm64: KVM: fix build with CONFIG_ARM_PMU disabled
Linus Torvalds [Thu, 30 Jun 2016 16:53:43 +0000 (09:53 -0700)]
Merge tag 'arc-4.7-rc6-fixes' of git://git./linux/kernel/git/vgupta/arc
Pull ARC fix from Vineet Gupta:
"Reinstate dwarf unwinder/loadable-modules with new gnu tools"
* tag 'arc-4.7-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
arc: unwind: warn only once if DW2_UNWIND is disabled
ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame)
Linus Torvalds [Thu, 30 Jun 2016 16:49:26 +0000 (09:49 -0700)]
Merge tag 'pwm/for-4.7-rc6' of git://git./linux/kernel/git/thierry.reding/linux-pwm
Pull pwm fixes from Thierry Reding:
"One more fix for some fallout observed after the introduction of the
atomic API"
* tag 'pwm/for-4.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: Fix pwm_apply_args()
Linus Torvalds [Thu, 30 Jun 2016 16:44:34 +0000 (09:44 -0700)]
Merge tag 'mfd-fixes-4.7' of git://git./linux/kernel/git/lee/mfd
Pull MFD fixes from Lee Jones:
"Contained are some standard fixes and unusually an extension to the
Reset API. Some of those changes are required to fix a bug introduced
in -rc1, which introduces extra 'reset line checks' i.e. whether the
line is shared or not. If a line is shared and the new *_shared() API
is not used, the request fails with an error. This breaks USB in v4.7
for ST's platforms.
Admittedly, there are some patches contained in our (MFD/Reset)
immutable branch which are not true -fixes, but there isn't anything I
can do about that. Rest assured though, there aren't any API
'changes'. Everything is the same from the consumer's perspective.
- Use new reset_*_get_shared() variant to prevent reset line
obtainment failure (Fixes commit
0b52297f2288: "reset: Add support
for shared reset controls")
- Fix unintentional switch() fall-through into error path
- Fix uninitialised variable compiler warning"
* tag 'mfd-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
mfd: da9053: Fix compiler warning message for uninitialised variable
mfd: max77620: Fix FPS switch statements
phy: phy-stih407-usb: Inform the reset framework that our reset line may be shared
usb: dwc3: st: Inform the reset framework that our reset line may be shared
usb: host: ehci-st: Inform the reset framework that our reset line may be shared
usb: host: ohci-st: Inform the reset framework that our reset line may be shared
reset: TRIVIAL: Add line break at same place for similar APIs
reset: Supply *_shared variant calls when using *_optional APIs
reset: Supply *_shared variant calls when using of_* API
reset: Ensure drivers are explicit when requesting reset lines
reset: Reorder inline reset_control_get*() wrappers
Paolo Bonzini [Thu, 30 Jun 2016 15:11:20 +0000 (17:11 +0200)]
Merge tag 'kvm-arm-for-v4.7-rc6' of git://git./linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM Fixes for v4.7-rc6:
Fixes a build issue without CONFIG_ARM_PMU and plugs pid leak on arm/arm64.
Steve Twiss [Mon, 27 Jun 2016 15:06:36 +0000 (16:06 +0100)]
mfd: da9053: Fix compiler warning message for uninitialised variable
Fix compiler warning caused by an uninitialised variable inside
da9052_group_write() function. Defaulting the value to zero covers
the trivial case.
Signed-off-by: Steve Twiss <stwiss.opensource@diasemi.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Rhyland Klein [Thu, 12 May 2016 17:45:04 +0000 (13:45 -0400)]
mfd: max77620: Fix FPS switch statements
When configuring FPS during probe, assuming a DT node is present for
FPS, the code can run into a problem with the switch statements in
max77620_config_fps() and max77620_get_fps_period_reg_value(). Namely,
in the case of chip->chip_id == MAX77620, it will set
fps_[mix|max]_period but then fall through to the default switch case
and return -EINVAL. Returning this from max77620_config_fps() will
cause probe to fail.
Signed-off-by: Rhyland Klein <rklein@nvidia.com>
Reviewed-by: Laxman Dewangan <ldewangan@nvidia.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Tested-by: Thierry Reding <treding@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Lee Jones [Tue, 28 Jun 2016 08:32:12 +0000 (09:32 +0100)]
phy: phy-stih407-usb: Inform the reset framework that our reset line may be shared
On the STiH410 B2120 development board the ports on the Generic PHY
share their reset lines with each other. New functionality in the
reset subsystems forces consumers to be explicit when requesting
shared/exclusive reset lines.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Lee Jones [Tue, 28 Jun 2016 08:23:58 +0000 (09:23 +0100)]
usb: dwc3: st: Inform the reset framework that our reset line may be shared
On the STiH410 B2120 development board the MiPHY28lp shares its reset
line with the Synopsys DWC3 SuperSpeed (SS) USB 3.0 Dual-Role-Device
(DRD). New functionality in the reset subsystems forces consumers to
be explicit when requesting shared/exclusive reset lines.
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Lee Jones [Mon, 6 Jun 2016 17:08:53 +0000 (18:08 +0100)]
usb: host: ehci-st: Inform the reset framework that our reset line may be shared
On the STiH410 B2120 development board the ST EHCI IP shares its reset
line with the OHCI IP. New functionality in the reset subsystems forces
consumers to be explicit when requesting shared/exclusive reset lines.
Acked-by: Peter Griffin <peter.griffin@linaro.org>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Lee Jones [Mon, 6 Jun 2016 17:08:54 +0000 (18:08 +0100)]
usb: host: ohci-st: Inform the reset framework that our reset line may be shared
On the STiH410 B2120 development board the ST EHCI IP shares its reset
line with the OHCI IP. New functionality in the reset subsystems forces
consumers to be explicit when requesting shared/exclusive reset lines.
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Linus Torvalds [Wed, 29 Jun 2016 22:30:26 +0000 (15:30 -0700)]
Merge tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"Stable bugfixes:
- Fix _cancel_empty_pagelist
- Fix a double page unlock
- Make nfs_atomic_open() call d_drop() on all ->open_context() errors.
- Fix another OPEN_DOWNGRADE bug
Other bugfixes:
- Ensure we handle delegation errors in nfs4_proc_layoutget()
- Layout stateids start out as being invalid
- Add sparse lock annotations for pnfs_find_alloc_layout
- Handle bad delegation stateids in nfs4_layoutget_handle_exception
- Fix up O_DIRECT results
- Fix potential use after free of state in nfs4_do_reclaim.
- Mark the layout stateid invalid when all segments are removed
- Don't let readdirplus revalidate an inode that was marked as stale
- Fix potential race in nfs_fhget()
- Fix an unused variable warning"
* tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix another OPEN_DOWNGRADE bug
make nfs_atomic_open() call d_drop() on all ->open_context() errors.
NFS: Fix an unused variable warning
NFS: Fix potential race in nfs_fhget()
NFS: Don't let readdirplus revalidate an inode that was marked as stale
NFSv4.1/pnfs: Mark the layout stateid invalid when all segments are removed
NFS: Fix a double page unlock
pnfs_nfs: fix _cancel_empty_pagelist
nfs4: Fix potential use after free of state in nfs4_do_reclaim.
NFS: Fix up O_DIRECT results
NFS/pnfs: handle bad delegation stateids in nfs4_layoutget_handle_exception
NFSv4.1/pnfs: Add sparse lock annotations for pnfs_find_alloc_layout
NFSv4.1/pnfs: Layout stateids start out as being invalid
NFSv4.1/pnfs: Ensure we handle delegation errors in nfs4_proc_layoutget()
Linus Torvalds [Wed, 29 Jun 2016 22:18:47 +0000 (15:18 -0700)]
Merge branch 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
"Two small patches to fix audit problems in 4.7-rcX: the first fixes a
potential kref leak, the second removes some header file noise.
The first is an important bug fix that really should go in before 4.7
is released, the second is not critical, but falls into the very-nice-
to-have category so I'm including in the pull request.
Both patches are straightforward, self-contained, and pass our
testsuite without problem"
* 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
audit: move audit_get_tty to reduce scope and kabi changes
audit: move calcs after alloc and check when logging set loginuid
Lee Jones [Mon, 6 Jun 2016 15:56:53 +0000 (16:56 +0100)]
reset: TRIVIAL: Add line break at same place for similar APIs
Standardise the way inline functions:
devm_reset_control_get_shared_by_index
devm_reset_control_get_exclusive_by_index
... are formatted.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Lee Jones [Mon, 6 Jun 2016 15:56:52 +0000 (16:56 +0100)]
reset: Supply *_shared variant calls when using *_optional APIs
Consumers need to be able to specify whether they are requesting an
'exclusive' or 'shared' reset line no matter which API (of_*, devm_*,
etc) they are using. This change allows users of the optional_* API
in particular to specify that their request is for a 'shared' line.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Lee Jones [Mon, 6 Jun 2016 15:56:51 +0000 (16:56 +0100)]
reset: Supply *_shared variant calls when using of_* API
Consumers need to be able to specify whether they are requesting an
'exclusive' or 'shared' reset line no matter which API (of_*, devm_*,
etc) they are using. This change allows users of the of_* API in
particular to specify that their request is for a 'shared' line.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Lee Jones [Mon, 6 Jun 2016 15:56:50 +0000 (16:56 +0100)]
reset: Ensure drivers are explicit when requesting reset lines
Phasing out generic reset line requests enables us to make some better
decisions on when and how to (de)assert said lines. If an 'exclusive'
line is requested, we know a device *requires* a reset and that it's
preferable to act upon a request right away. However, if a 'shared'
reset line is requested, we can reasonably assume sure that placing a
device into reset isn't a hard requirement, but probably a measure to
save power and is thus able to cope with not being asserted if another
device is still in use.
In order allow gentle adoption and not to forcing all consumers to
move to the API immediately, causing administration headache between
subsystems, this patch adds some temporary stand-in shim-calls. This
will ease the burden at merge time and allow subsystems to migrate over
to the new API in a more realistic time-frame.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Lee Jones [Mon, 6 Jun 2016 15:56:49 +0000 (16:56 +0100)]
reset: Reorder inline reset_control_get*() wrappers
We're about to split the current API into two, where consumers will
be forced to be explicit when requesting reset lines. The choice
will be to either the call the *_exclusive or *_shared variant
depending on whether they can actually tolorate not being asserted
when that request is made.
The new API will look like this once reorded and complete:
reset_control_get_exclusive()
reset_control_get_shared()
reset_control_get_optional_exclusive()
reset_control_get_optional_shared()
of_reset_control_get_exclusive()
of_reset_control_get_shared()
of_reset_control_get_exclusive_by_index()
of_reset_control_get_shared_by_index()
devm_reset_control_get_exclusive()
devm_reset_control_get_shared()
devm_reset_control_get_optional_exclusive()
devm_reset_control_get_optional_shared()
devm_reset_control_get_exclusive_by_index()
devm_reset_control_get_shared_by_index()
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Linus Torvalds [Wed, 29 Jun 2016 18:50:42 +0000 (11:50 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell
88E1111
...
Linus Torvalds [Wed, 29 Jun 2016 18:48:05 +0000 (11:48 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Another two bug fixes for 4.7:
- The revert of patch which removed boot information for systems
using an intermediate boot kernel, e.g. the SLES12 grub setup.
- A fix for an incorrect inline assembly constraint that causes
broken code to be generated with gcc 4.8.5"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: fix test_fp_ctl inline assembly contraints
Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL"
Linus Torvalds [Wed, 29 Jun 2016 17:05:44 +0000 (10:05 -0700)]
Merge tag 'pinctrl-v4.7-3' of git://git./linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here are a bunch of fixes for pin control. Just drivers and a
MAINTAINERS fixup:
- Driver fixes for i.MX, single register, Tegra and BayTrail.
- MAINTAINERS entry for the documentation"
* tag 'pinctrl-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: baytrail: Fix mingled clock pins
MAINTAINERS: belong Documentation/pinctrl.txt properly
pinctrl: tegra: Fix build dependency
gpio: tegra: Make lockdep class file-scoped
pinctrl: single: Fix missing flush of posted write for a wakeirq
pinctrl: imx: Do not treat a PIN without MUX register as an error
Linus Torvalds [Wed, 29 Jun 2016 17:04:42 +0000 (10:04 -0700)]
Merge branch 'for-4.7-fixes' of git://git./linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Three fix patches. Two are for cgroup / css init failure path. The
last one makes css_set_lock irq-safe as the deadline scheduler ends up
calling put_css_set() from irq context"
* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Disable IRQs while holding css_set_lock
cgroup: set css->id to -1 during init
cgroup: remove redundant cleanup in css_create
David S. Miller [Wed, 29 Jun 2016 12:33:46 +0000 (08:33 -0400)]
Merge tag 'mac80211-for-davem-2016-06-29-v2' of git://git./linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Just two small fixes
* fix mesh peer link counter, decrement wasn't always done at all
* fix ethertype (length) for packets without RFC 1042 or bridge
tunnel header
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Samuel Gauthier [Tue, 28 Jun 2016 15:22:26 +0000 (17:22 +0200)]
openvswitch: fix conntrack netlink event delivery
Only the first and last netlink message for a particular conntrack are
actually sent. The first message is sent through nf_conntrack_confirm when
the conntrack is committed. The last one is sent when the conntrack is
destroyed on timeout. The other conntrack state change messages are not
advertised.
When the conntrack subsystem is used from netfilter, nf_conntrack_confirm
is called for each packet, from the postrouting hook, which in turn calls
nf_ct_deliver_cached_events to send the state change netlink messages.
This commit fixes the problem by calling nf_ct_deliver_cached_events in the
non-commit case as well.
Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
CC: Joe Stringer <joestringer@nicira.com>
CC: Justin Pettit <jpettit@nicira.com>
CC: Andy Zhou <azhou@nicira.com>
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Tue, 28 Jun 2016 11:46:03 +0000 (07:46 -0400)]
qed: Protect the doorbell BAR with the write barriers.
SPQ doorbell is currently protected with the compilation barrier. Under the
stress scenarios, we may get into a state where (due to the weak ordering)
several ramrod doorbells were written to the BAR with an out-of-order
producer values. Need to change the barrier type to a write barrier to make
sure that the write buffer is flushed after each doorbell.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Barroso [Tue, 28 Jun 2016 08:16:43 +0000 (11:16 +0300)]
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
neigh_xmit() expects to be called inside an RCU-bh read side critical
section, and while one of its two current callers gets this right, the
other one doesn't.
More specifically, neigh_xmit() has two callers, mpls_forward() and
mpls_output(), and while both callers call neigh_xmit() under
rcu_read_lock(), this provides sufficient protection for neigh_xmit()
only in the case of mpls_forward(), as that is always called from
softirq context and therefore doesn't need explicit BH protection,
while mpls_output() can be called from process context with softirqs
enabled.
When mpls_output() is called from process context, with softirqs
enabled, we can be preempted by a softirq at any time, and RCU-bh
considers the completion of a softirq as signaling the end of any
pending read-side critical sections, so if we do get a softirq
while we are in the part of neigh_xmit() that expects to be run inside
an RCU-bh read side critical section, we can end up with an unexpected
RCU grace period running right in the middle of that critical section,
making things go boom.
This patch fixes this impedance mismatch in the callee, by making
neigh_xmit() always take rcu_read_{,un}lock_bh() around the code that
expects to be treated as an RCU-bh read side critical section, as this
seems a safer option than fixing it in the callers.
Fixes: 4fd3d7d9e868f ("neigh: Add helper function neigh_xmit")
Signed-off-by: David Barroso <dbarroso@fastly.com>
Signed-off-by: Lennert Buytenhek <lbuytenhek@fastly.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarod Wilson [Wed, 29 Jun 2016 03:41:31 +0000 (20:41 -0700)]
e1000e: keep VLAN interfaces functional after rxvlan off
I've got a bug report about an e1000e interface, where a VLAN interface is
set up on top of it:
$ ip link add link ens1f0 name ens1f0.99 type vlan id 99
$ ip link set ens1f0 up
$ ip link set ens1f0.99 up
$ ip addr add 192.168.99.92 dev ens1f0.99
At this point, I can ping another host on vlan 99, ip 192.168.99.91.
However, if I do the following:
$ ethtool -K ens1f0 rxvlan off
Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on
again. I'm not sure if this is actually intended behavior, or if there's a
lack of software VLAN stripping fallback, or what, but things continue to
work if I simply don't call e1000e_vlan_strip_disable() if there are
active VLANs (plagiarizing a function from the e1000 driver here) on the
interface.
Also slipped a related-ish fix to the kerneldoc text for
e1000e_vlan_strip_disable here...
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Felix Fietkau [Wed, 29 Jun 2016 08:36:39 +0000 (10:36 +0200)]
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
The PDU length of incoming LLC frames is set to the total skb payload size
in __ieee80211_data_to_8023() of net/wireless/util.c which incorrectly
includes the length of the IEEE 802.11 header.
The resulting LLC frame header has a too large PDU length, causing the
llc_fixup_skb() function of net/llc/llc_input.c to reject the incoming
skb, effectively breaking STP.
Solve the problem by properly substracting the IEEE 802.11 frame header size
from the PDU length, allowing the LLC processor to pick up the incoming
control messages.
Special thanks to Gerry Rozema for tracking down the regression and proposing
a suitable patch.
Fixes: 2d1c304cb2d5 ("cfg80211: add function for 802.3 conversion with separate output buffer")
Cc: stable@vger.kernel.org
Reported-by: Gerry Rozema <gerryr@rozeware.com>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Dan Carpenter [Mon, 27 Jun 2016 20:50:29 +0000 (23:50 +0300)]
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
There is a static checker warning here "warn: mask and shift to zero"
and the code sets "ring" to zero every time. From looking at how
QLCNIC_FETCH_RING_ID() is used in qlcnic_83xx_process_rcv_ring() the
qlcnic_83xx_hndl() should be removed.
Fixes: 4be41e92f7c6 ('qlcnic: 83xx data path routines')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 27 Jun 2016 19:38:11 +0000 (21:38 +0200)]
bpf, perf: delay release of BPF prog after grace period
Commit
dead9f29ddcc ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).
Fixes: dead9f29ddcc ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Mon, 27 Jun 2016 16:34:42 +0000 (18:34 +0200)]
net: bridge: fix vlan stats continue counter
I made a dumb off-by-one mistake when I added the vlan stats counter
dumping code. The increment should happen before the check, not after
otherwise we miss one entry when we continue dumping.
Fixes: a60c090361ea ("bridge: netlink: export per-vlan stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 27 Jun 2016 15:38:50 +0000 (17:38 +0200)]
tcp: do not send too big packets at retransmit time
Arjun reported a bug in TCP stack and bisected it to a recent commit.
In case where we process SACK, we can coalesce multiple skbs
into fat ones (tcp_shift_skb_data()), to lower write queue
overhead, because we do not expect to retransmit these packets.
However, SACK reneging can happen, forcing the sender to retransmit
all these packets. If skb->len is above 64KB, we then send buggy
IP packets that could hang TSO engine on cxgb4.
Neal suggested to use tcp_tso_autosize() instead of tp->gso_segs
so that we cook packets of optimal size vs TCP/pacing.
Thanks to Arjun for reporting the bug and running the tests !
Fixes: 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Arjun V <arjun@chelsio.com>
Tested-by: Arjun V <arjun@chelsio.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Mon, 27 Jun 2016 12:48:53 +0000 (20:48 +0800)]
ibmvnic: fix to use list_for_each_safe() when delete items
Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each() macro aptly named
list_for_each_safe().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>