Pull io_uring IO interface from Jens Axboe:
"Second attempt at adding the io_uring interface.
Since the first one, we've added basic unit testing of the three
system calls, that resides in liburing like the other unit tests that
we have so far. It'll take a while to get full coverage of it, but
we're working towards it. I've also added two basic test programs to
tools/io_uring. One uses the raw interface and has support for all the
various features that io_uring supports outside of standard IO, like
fixed files, fixed IO buffers, and polled IO. The other uses the
liburing API, and is a simplified version of cp(1).
This adds support for a new IO interface, io_uring.
io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:
https://lore.kernel.org/linux-block/
20190116175003.17880-1-axboe@kernel.dk/
Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.
Outside of basic IO features, it supports async polled IO as well.
This particular feature has already been tested at Facebook months ago
for flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees
a nice win in terms of efficiency, latency, and performance. These
boxes were IOPS bound before, now they are not.
This series adds three new system calls. One for setting up an
io_uring instance (io_uring_setup(2)), one for submitting/completing
IO (io_uring_enter(2)), and one for aux functions like registrating
file sets, buffers, etc (io_uring_register(2)). Through the help of
Arnd, I've coordinated the syscall numbers so merge on that front
should be painless.
Jon did a writeup of the interface a while back, which (except for
minor details that have been tweaked) is still accurate. Find that
here:
https://lwn.net/Articles/776703/
Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.
There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:
git://git.kernel.dk/liburing
Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface"
* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
io_uring: add a few test tools
io_uring: allow workqueue item to handle multiple buffered requests
io_uring: add support for IORING_OP_POLL
io_uring: add io_kiocb ref count
io_uring: add submission polling
io_uring: add file set registration
net: split out functions related to registering inflight socket files
io_uring: add support for pre-mapped user IO buffers
block: implement bio helper to add iter bvec pages to bio
io_uring: batch io_kiocb allocation
io_uring: use fget/fput_many() for file references
fs: add fget_many() and fput_many()
io_uring: support for IO polling
io_uring: add fsync support
Add io_uring IO interface
382 i386 pkey_free sys_pkey_free __ia32_sys_pkey_free
383 i386 statx sys_statx __ia32_sys_statx
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
-385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
+385 i386 io_pgetevents sys_io_pgetevents_time32 __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
+# don't use numbers 387 through 392, add new calls at the end
+393 i386 semget sys_semget __ia32_sys_semget
+394 i386 semctl sys_semctl __ia32_compat_sys_semctl
+395 i386 shmget sys_shmget __ia32_sys_shmget
+396 i386 shmctl sys_shmctl __ia32_compat_sys_shmctl
+397 i386 shmat sys_shmat __ia32_compat_sys_shmat
+398 i386 shmdt sys_shmdt __ia32_sys_shmdt
+399 i386 msgget sys_msgget __ia32_sys_msgget
+400 i386 msgsnd sys_msgsnd __ia32_compat_sys_msgsnd
+401 i386 msgrcv sys_msgrcv __ia32_compat_sys_msgrcv
+402 i386 msgctl sys_msgctl __ia32_compat_sys_msgctl
+403 i386 clock_gettime64 sys_clock_gettime __ia32_sys_clock_gettime
+404 i386 clock_settime64 sys_clock_settime __ia32_sys_clock_settime
+405 i386 clock_adjtime64 sys_clock_adjtime __ia32_sys_clock_adjtime
+406 i386 clock_getres_time64 sys_clock_getres __ia32_sys_clock_getres
+407 i386 clock_nanosleep_time64 sys_clock_nanosleep __ia32_sys_clock_nanosleep
+408 i386 timer_gettime64 sys_timer_gettime __ia32_sys_timer_gettime
+409 i386 timer_settime64 sys_timer_settime __ia32_sys_timer_settime
+410 i386 timerfd_gettime64 sys_timerfd_gettime __ia32_sys_timerfd_gettime
+411 i386 timerfd_settime64 sys_timerfd_settime __ia32_sys_timerfd_settime
+412 i386 utimensat_time64 sys_utimensat __ia32_sys_utimensat
+413 i386 pselect6_time64 sys_pselect6 __ia32_compat_sys_pselect6_time64
+414 i386 ppoll_time64 sys_ppoll __ia32_compat_sys_ppoll_time64
+416 i386 io_pgetevents_time64 sys_io_pgetevents __ia32_sys_io_pgetevents
+417 i386 recvmmsg_time64 sys_recvmmsg __ia32_compat_sys_recvmmsg_time64
+418 i386 mq_timedsend_time64 sys_mq_timedsend __ia32_sys_mq_timedsend
+419 i386 mq_timedreceive_time64 sys_mq_timedreceive __ia32_sys_mq_timedreceive
+420 i386 semtimedop_time64 sys_semtimedop __ia32_sys_semtimedop
+421 i386 rt_sigtimedwait_time64 sys_rt_sigtimedwait __ia32_compat_sys_rt_sigtimedwait_time64
+422 i386 futex_time64 sys_futex __ia32_sys_futex
+423 i386 sched_rr_get_interval_time64 sys_sched_rr_get_interval __ia32_sys_sched_rr_get_interval
+ 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
+ 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
+ 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
+# don't use numbers 387 through 423, add new calls after the last
+# 'common' entry
+ 425 common io_uring_setup __x64_sys_io_uring_setup
+ 426 common io_uring_enter __x64_sys_io_uring_enter
+ 427 common io_uring_register __x64_sys_io_uring_register
#
# x32-specific system call numbers start at 512 to avoid cache impact
__SYSCALL(__NR_rseq, sys_rseq)
#define __NR_kexec_file_load 294
__SYSCALL(__NR_kexec_file_load, sys_kexec_file_load)
+/* 295 through 402 are unassigned to sync up with generic numbers, don't use */
+#if __BITS_PER_LONG == 32
+#define __NR_clock_gettime64 403
+__SYSCALL(__NR_clock_gettime64, sys_clock_gettime)
+#define __NR_clock_settime64 404
+__SYSCALL(__NR_clock_settime64, sys_clock_settime)
+#define __NR_clock_adjtime64 405
+__SYSCALL(__NR_clock_adjtime64, sys_clock_adjtime)
+#define __NR_clock_getres_time64 406
+__SYSCALL(__NR_clock_getres_time64, sys_clock_getres)
+#define __NR_clock_nanosleep_time64 407
+__SYSCALL(__NR_clock_nanosleep_time64, sys_clock_nanosleep)
+#define __NR_timer_gettime64 408
+__SYSCALL(__NR_timer_gettime64, sys_timer_gettime)
+#define __NR_timer_settime64 409
+__SYSCALL(__NR_timer_settime64, sys_timer_settime)
+#define __NR_timerfd_gettime64 410
+__SYSCALL(__NR_timerfd_gettime64, sys_timerfd_gettime)
+#define __NR_timerfd_settime64 411
+__SYSCALL(__NR_timerfd_settime64, sys_timerfd_settime)
+#define __NR_utimensat_time64 412
+__SYSCALL(__NR_utimensat_time64, sys_utimensat)
+#define __NR_pselect6_time64 413
+__SC_COMP(__NR_pselect6_time64, sys_pselect6, compat_sys_pselect6_time64)
+#define __NR_ppoll_time64 414
+__SC_COMP(__NR_ppoll_time64, sys_ppoll, compat_sys_ppoll_time64)
+#define __NR_io_pgetevents_time64 416
+__SYSCALL(__NR_io_pgetevents_time64, sys_io_pgetevents)
+#define __NR_recvmmsg_time64 417
+__SC_COMP(__NR_recvmmsg_time64, sys_recvmmsg, compat_sys_recvmmsg_time64)
+#define __NR_mq_timedsend_time64 418
+__SYSCALL(__NR_mq_timedsend_time64, sys_mq_timedsend)
+#define __NR_mq_timedreceive_time64 419
+__SYSCALL(__NR_mq_timedreceive_time64, sys_mq_timedreceive)
+#define __NR_semtimedop_time64 420
+__SYSCALL(__NR_semtimedop_time64, sys_semtimedop)
+#define __NR_rt_sigtimedwait_time64 421
+__SC_COMP(__NR_rt_sigtimedwait_time64, sys_rt_sigtimedwait, compat_sys_rt_sigtimedwait_time64)
+#define __NR_futex_time64 422
+__SYSCALL(__NR_futex_time64, sys_futex)
+#define __NR_sched_rr_get_interval_time64 423
+__SYSCALL(__NR_sched_rr_get_interval_time64, sys_sched_rr_get_interval)
+#endif
+
+ #define __NR_io_uring_setup 425
+ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+ #define __NR_io_uring_enter 426
+ __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+ #define __NR_io_uring_register 427
+ __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+
#undef __NR_syscalls
- #define __NR_syscalls 424
+ #define __NR_syscalls 428
/*
* 32 bit systems traditionally used different
COND_SYSCALL(io_submit);
COND_SYSCALL_COMPAT(io_submit);
COND_SYSCALL(io_cancel);
+COND_SYSCALL(io_getevents_time32);
COND_SYSCALL(io_getevents);
+COND_SYSCALL(io_pgetevents_time32);
COND_SYSCALL(io_pgetevents);
-COND_SYSCALL_COMPAT(io_getevents);
+COND_SYSCALL_COMPAT(io_pgetevents_time32);
COND_SYSCALL_COMPAT(io_pgetevents);
+ COND_SYSCALL(io_uring_setup);
+ COND_SYSCALL(io_uring_enter);
+ COND_SYSCALL(io_uring_register);
/* fs/xattr.c */