+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================
Checksum Offloads in the Linux Networking Stack
+===============================================
Introduction
============
-This document describes a set of techniques in the Linux networking stack
- to take advantage of checksum offload capabilities of various NICs.
+This document describes a set of techniques in the Linux networking stack to
+take advantage of checksum offload capabilities of various NICs.
The following technologies are described:
- * TX Checksum Offload
- * LCO: Local Checksum Offload
- * RCO: Remote Checksum Offload
+
+* TX Checksum Offload
+* LCO: Local Checksum Offload
+* RCO: Remote Checksum Offload
Things that should be documented here but aren't yet:
- * RX Checksum Offload
- * CHECKSUM_UNNECESSARY conversion
+
+* RX Checksum Offload
+* CHECKSUM_UNNECESSARY conversion
TX Checksum Offload
===================
-The interface for offloading a transmit checksum to a device is explained
- in detail in comments near the top of include/linux/skbuff.h.
+The interface for offloading a transmit checksum to a device is explained in
+detail in comments near the top of include/linux/skbuff.h.
+
In brief, it allows to request the device fill in a single ones-complement
- checksum defined by the sk_buff fields skb->csum_start and
- skb->csum_offset. The device should compute the 16-bit ones-complement
- checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the
- packet, and fill in the result at (csum_start + csum_offset).
-Because csum_offset cannot be negative, this ensures that the previous
- value of the checksum field is included in the checksum computation, thus
- it can be used to supply any needed corrections to the checksum (such as
- the sum of the pseudo-header for UDP or TCP).
+checksum defined by the sk_buff fields skb->csum_start and skb->csum_offset.
+The device should compute the 16-bit ones-complement checksum (i.e. the
+'IP-style' checksum) from csum_start to the end of the packet, and fill in the
+result at (csum_start + csum_offset).
+
+Because csum_offset cannot be negative, this ensures that the previous value of
+the checksum field is included in the checksum computation, thus it can be used
+to supply any needed corrections to the checksum (such as the sum of the
+pseudo-header for UDP or TCP).
+
This interface only allows a single checksum to be offloaded. Where
- encapsulation is used, the packet may have multiple checksum fields in
- different header layers, and the rest will have to be handled by another
- mechanism such as LCO or RCO.
+encapsulation is used, the packet may have multiple checksum fields in
+different header layers, and the rest will have to be handled by another
+mechanism such as LCO or RCO.
+
CRC32c can also be offloaded using this interface, by means of filling
- skb->csum_start and skb->csum_offset as described above, and setting
- skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
+skb->csum_start and skb->csum_offset as described above, and setting
+skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
+
No offloading of the IP header checksum is performed; it is always done in
- software. This is OK because when we build the IP header, we obviously
- have it in cache, so summing it isn't expensive. It's also rather short.
+software. This is OK because when we build the IP header, we obviously have it
+in cache, so summing it isn't expensive. It's also rather short.
+
The requirements for GSO are more complicated, because when segmenting an
- encapsulated packet both the inner and outer checksums may need to be
- edited or recomputed for each resulting segment. See the skbuff.h comment
- (section 'E') for more details.
+encapsulated packet both the inner and outer checksums may need to be edited or
+recomputed for each resulting segment. See the skbuff.h comment (section 'E')
+for more details.
A driver declares its offload capabilities in netdev->hw_features; see
- Documentation/networking/netdev-features.txt for more. Note that a device
- which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start
- and csum_offset given in the SKB; if it tries to deduce these itself in
- hardware (as some NICs do) the driver should check that the values in the
- SKB match those which the hardware will deduce, and if not, fall back to
- checksumming in software instead (with skb_csum_hwoffload_help() or one of
- the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
- include/linux/skbuff.h).
-
-The stack should, for the most part, assume that checksum offload is
- supported by the underlying device. The only place that should check is
- validate_xmit_skb(), and the functions it calls directly or indirectly.
- That function compares the offload features requested by the SKB (which
- may include other offloads besides TX Checksum Offload) and, if they are
- not supported or enabled on the device (determined by netdev->features),
- performs the corresponding offload in software. In the case of TX
- Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features).
+Documentation/networking/netdev-features.txt for more. Note that a device
+which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
+csum_offset given in the SKB; if it tries to deduce these itself in hardware
+(as some NICs do) the driver should check that the values in the SKB match
+those which the hardware will deduce, and if not, fall back to checksumming in
+software instead (with skb_csum_hwoffload_help() or one of the
+skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
+include/linux/skbuff.h).
+
+The stack should, for the most part, assume that checksum offload is supported
+by the underlying device. The only place that should check is
+validate_xmit_skb(), and the functions it calls directly or indirectly. That
+function compares the offload features requested by the SKB (which may include
+other offloads besides TX Checksum Offload) and, if they are not supported or
+enabled on the device (determined by netdev->features), performs the
+corresponding offload in software. In the case of TX Checksum Offload, that
+means calling skb_csum_hwoffload_help(skb, features).
LCO: Local Checksum Offload
===========================
LCO is a technique for efficiently computing the outer checksum of an
- encapsulated datagram when the inner checksum is due to be offloaded.
-The ones-complement sum of a correctly checksummed TCP or UDP packet is
- equal to the complement of the sum of the pseudo header, because everything
- else gets 'cancelled out' by the checksum field. This is because the sum was
- complemented before being written to the checksum field.
+encapsulated datagram when the inner checksum is due to be offloaded.
+
+The ones-complement sum of a correctly checksummed TCP or UDP packet is equal
+to the complement of the sum of the pseudo header, because everything else gets
+'cancelled out' by the checksum field. This is because the sum was
+complemented before being written to the checksum field.
+
More generally, this holds in any case where the 'IP-style' ones complement
- checksum is used, and thus any checksum that TX Checksum Offload supports.
+checksum is used, and thus any checksum that TX Checksum Offload supports.
+
That is, if we have set up TX Checksum Offload with a start/offset pair, we
- know that after the device has filled in that checksum, the ones
- complement sum from csum_start to the end of the packet will be equal to
- the complement of whatever value we put in the checksum field beforehand.
- This allows us to compute the outer checksum without looking at the payload:
- we simply stop summing when we get to csum_start, then add the complement of
- the 16-bit word at (csum_start + csum_offset).
+know that after the device has filled in that checksum, the ones complement sum
+from csum_start to the end of the packet will be equal to the complement of
+whatever value we put in the checksum field beforehand. This allows us to
+compute the outer checksum without looking at the payload: we simply stop
+summing when we get to csum_start, then add the complement of the 16-bit word
+at (csum_start + csum_offset).
+
Then, when the true inner checksum is filled in (either by hardware or by
- skb_checksum_help()), the outer checksum will become correct by virtue of
- the arithmetic.
+skb_checksum_help()), the outer checksum will become correct by virtue of the
+arithmetic.
LCO is performed by the stack when constructing an outer UDP header for an
- encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for
- the IPv6 equivalents, in udp6_set_csum().
+encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for the
+IPv6 equivalents, in udp6_set_csum().
+
It is also performed when constructing an IPv4 GRE header, in
- net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
- constructing an IPv6 GRE header; the GRE checksum is computed over the
- whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be
- possible to use LCO here as IPv6 GRE still uses an IP-style checksum.
+net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
+constructing an IPv6 GRE header; the GRE checksum is computed over the whole
+packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be possible to use
+LCO here as IPv6 GRE still uses an IP-style checksum.
+
All of the LCO implementations use a helper function lco_csum(), in
- include/linux/skbuff.h.
+include/linux/skbuff.h.
LCO can safely be used for nested encapsulations; in this case, the outer
- encapsulation layer will sum over both its own header and the 'middle'
- header. This does mean that the 'middle' header will get summed multiple
- times, but there doesn't seem to be a way to avoid that without incurring
- bigger costs (e.g. in SKB bloat).
+encapsulation layer will sum over both its own header and the 'middle' header.
+This does mean that the 'middle' header will get summed multiple times, but
+there doesn't seem to be a way to avoid that without incurring bigger costs
+(e.g. in SKB bloat).
RCO: Remote Checksum Offload
============================
-RCO is a technique for eliding the inner checksum of an encapsulated
- datagram, allowing the outer checksum to be offloaded. It does, however,
- involve a change to the encapsulation protocols, which the receiver must
- also support. For this reason, it is disabled by default.
+RCO is a technique for eliding the inner checksum of an encapsulated datagram,
+allowing the outer checksum to be offloaded. It does, however, involve a
+change to the encapsulation protocols, which the receiver must also support.
+For this reason, it is disabled by default.
+
RCO is detailed in the following Internet-Drafts:
-https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
-https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
-In Linux, RCO is implemented individually in each encapsulation protocol,
- and most tunnel types have flags controlling its use. For instance, VXLAN
- has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that
- RCO should be used when transmitting to a given remote destination.
+
+* https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
+* https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
+
+In Linux, RCO is implemented individually in each encapsulation protocol, and
+most tunnel types have flags controlling its use. For instance, VXLAN has the
+flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that RCO should be
+used when transmitting to a given remote destination.
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================
Segmentation Offloads in the Linux Networking Stack
+===================================================
+
Introduction
============
* Partial Generic Segmentation Offload - GSO_PARTIAL
* SCTP accelleration with GSO - GSO_BY_FRAGS
+
TCP Segmentation Offload
========================
and we will either increment the IP ID for all frames, or leave it at a
static value based on driver preference.
+
UDP Fragmentation Offload
=========================
still receive them from tuntap and similar devices. Offload of UDP-based
tunnel protocols is still supported.
+
IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads
========================================================
data is normally referred to as the inner headers. Below is the list of
calls to access the given headers:
-IPIP/SIT Tunnel:
- Outer Inner
-MAC skb_mac_header
-Network skb_network_header skb_inner_network_header
-Transport skb_transport_header
+IPIP/SIT Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header
-UDP/GRE Tunnel:
- Outer Inner
-MAC skb_mac_header skb_inner_mac_header
-Network skb_network_header skb_inner_network_header
-Transport skb_transport_header skb_inner_transport_header
+UDP/GRE Tunnel::
+
+ Outer Inner
+ MAC skb_mac_header skb_inner_mac_header
+ Network skb_network_header skb_inner_network_header
+ Transport skb_transport_header skb_inner_transport_header
In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and
SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the
headers will be left with a partial checksum and only the outer header
checksum will be computed.
+
Generic Segmentation Offload
============================
offload is required in GSO. Otherwise it becomes possible for a frame to
be re-routed between devices and end up being unable to be transmitted.
+
Generic Receive Offload
=======================
If the value of the IPv4 ID is not sequentially incrementing it will be
altered so that it is when a frame assembled via GRO is segmented via GSO.
+
Partial Generic Segmentation Offload
====================================
that the IPv4 ID field is incremented in the case that a given header does
not have the DF bit set.
+
SCTP accelleration with GSO
===========================
There are some helpers to make this easier:
- - skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
- an skb is an SCTP GSO skb.
+- skb_is_gso(skb) && skb_is_gso_sctp(skb) is the best way to see if
+ an skb is an SCTP GSO skb.
- - For size checks, the skb_gso_validate_*_len family of helpers correctly
- considers GSO_BY_FRAGS.
+- For size checks, the skb_gso_validate_*_len family of helpers correctly
+ considers GSO_BY_FRAGS.
- - For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
- will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
+- For manipulating packets, skb_increase_gso_size and skb_decrease_gso_size
+ will check for GSO_BY_FRAGS and WARN if asked to manipulate these skbs.
This also affects drivers with the NETIF_F_FRAGLIST & NETIF_F_GSO_SCTP bits
set. Note also that NETIF_F_GSO_SCTP is included in NETIF_F_GSO_SOFTWARE.