g============================================================================== irdma - Linux* RDMA Driver for the E810 and X722 Intel(R) Ethernet Controllers ============================================================================== -------- Contents -------- - Overview - Prerequisites - Supported OS List - Building and Installation - Confirm RDMA Functionality - iWARP/RoCEv2 Selection - iWARP Port Mapper (iwpmd) - Flow Control Settings - ECN Configuration - Devlink Configuration - Memory Requirements - Resource Profile Limits - Resource Limits Selector - RDMA Statistics - perftest - MPI - Performance - Interoperability - Dynamic Tracing - Dynamic Debug - Capturing RDMA Traffic with tcpdump - Known Issues/Notes -------- Overview -------- The irdma Linux* driver enables RDMA functionality on RDMA-capable Intel network devices. Devices supported by this driver: - Intel(R) Ethernet Controller E810 - Intel(R) Ethernet Network Adapter X722 The E810 and X722 devices each support a different set of RDMA features. - E810 supports both iWARP and RoCEv2 RDMA transports, and also supports congestion management features like priority flow control (PFC) and explicit congestion notification (ECN). - X722 supports only iWARP and a more limited set of configuration parameters. Differences between adapters are described in each section of this document. For both E810 and X722, the corresponding LAN driver (ice or i40e) must be built from source included in this release and installed on your system prior to installing irdma. ------------- Prerequisites ------------- - Compile and install the E810 or X722 LAN PF driver from source included in this release. Refer to the ice or i40e driver README for installation instructions. * For E810 adapters, use the ice driver. * For X722 adapters, use the i40e driver. - For best results, use a fully supported OS from the Supported OS List below. - For server memory requirements, see the "Memory Requirements" section of this document. - Install required packages. Refer to the "Building" section of the rdma-core README for required packages for your OS: https://github.com/linux-rdma/rdma-core/blob/v27.0/README.md * RHEL 7 and SLES: Install all required packages listed in the rdma-core README. * RHEL 8: Install the required packages for RHEL 7, then install the following additional packages: dnf install python3-docutils perl-generators * Ubuntu: Install the required packages listed in the rdma-core README, then install the following additional package: apt-get install libsystemd-dev ----------------- Supported OS List ----------------- Supported: * RHEL 8.2 * RHEL 8.1 * RHEL 7.8 * SLES 15 SP1 * SLES 12 SP5 Supported Not Validated: * RHEL 8 * RHEL 7.7 * RHEL 7.6 + OFED 4.17-1 * RHEL 7.5 + OFED 4.17-1 * RHEL 7.4 + OFED 4.17-1 * SLES 15 * SLES 15 + OFED 4.17-1 * SLES 12 SP 4 * SLES 12 SP 4 + OFED 4.17-1 * SLES 12 SP 3 + OFED 4.17-1 * Ubuntu 18.04 * Ubuntu 20.04 * Linux kernel stable 5.8.* * Linux kernel LTS 5.4.* 4.19.*, 4.14.* ------------------------- Building and Installation ------------------------- To build and install the irdma driver and supporting rdma-core libraries: 1. Decompress the irdma driver archive: tar zxf irdma-.tgz 2. Build and install the RDMA driver: cd irdma- ./build.sh By default, the irdma driver is built using in-distro RDMA libraries and modules. Optionally, irdma may also be built using OFED modules. See the Supported OS List above for a list of OSes that support this option. * Note: Intel products are not validated on other vendors' proprietary software packages. To install irdma using OFED modules: - Download OFED-4.17-1.tgz from the OpenFabrics Alliance: wget http://openfabrics.org/downloads/OFED/ofed-4.17-1/OFED-4.17-1.tgz - Decompress the archive: tar xzvf OFED-4.17.1.tgz - Install OFED: cd OFED-4.17-1 ./install --all - Reboot after installation is complete. - Build the irdma driver with the "ofed" option: cd /path/to/irdma- ./build.sh ofed - Continue with the installation steps below. 3. Load the driver: RHEL and Ubuntu: modprobe irdma SLES: modprobe irdma --allow-unsupported Notes: - This modprobe step is required only during installation. Normally, irdma is autoloaded via a udev rule when ice or i40e is loaded: /usr/lib/udev/rules.d/90-rdma-hw-modules.rules - For SLES, to automatically allow loading unsupported modules, add the following to /etc/modprobe.d/10-unsupported-modules.conf: allow_unsupported_modules 1 4. Uninstall any previous versions of rdma-core user-space libraries. For example, in RHEL: yum erase rdma-core Note: "yum erase rdma-core" will also remove any packages that depend on rdma-core, such as perftest or fio. Please re-install them after installing rdma-core. 5. Patch, build, and install rdma-core user space libraries: RHEL: # Download rdma-core-27.0.tar.gz from GitHub wget https://github.com/linux-rdma/rdma-core/releases/download/v27.0/rdma-core-27.0.tar.gz # Apply patch libirdma-27.0.patch to rdma-core tar -xzvf rdma-core-27.0.tar.gz cd rdma-core-27.0 patch -p2 < /path/to/irdma-/libirdma-27.0.patch # Make sure directories rdma-core/redhat and contents are under group 'root' cd .. chgrp -R root rdma-core-27.0/redhat # Zip with proper name for building (note "tgz" extension instead of "tar.gz") tar -zcvf rdma-core-27.0.tgz rdma-core-27.0 # Build rdma-core mkdir -p ~/rpmbuild/SOURCES mkdir -p ~/rpmbuild/SPECS cp rdma-core-27.0.tgz ~/rpmbuild/SOURCES/ cd ~/rpmbuild/SOURCES tar -xzvf rdma-core-27.0.tgz cp ~/rpmbuild/SOURCES/rdma-core-27.0/redhat/rdma-core.spec ~/rpmbuild/SPECS/ cd ~/rpmbuild/SPECS/ rpmbuild -ba rdma-core.spec # Install RPMs cd ~/rpmbuild/RPMS/x86_64 yum install *27.0*.rpm SLES: # Download rdma-core-27.0.tar.gz from GitHub wget https://github.com/linux-rdma/rdma-core/releases/download/v27.0/rdma-core-27.0.tar.gz # Apply patch libirdma-27.0.patch to rdma-core tar -xzvf rdma-core-27.0.tar.gz cd rdma-core-27.0 patch -p2 < /path/to/irdma-/libirdma-27.0.patch cd .. # Zip the rdma-core directory into a tar.gz archive tar -zcvf rdma-core-27.0.tar.gz rdma-core-27.0 # Create an empty placeholder baselibs.conf file touch /usr/src/packages/SOURCES/baselibs.conf # Build rdma-core cp rdma-core-27.0.tar.gz /usr/src/packages/SOURCES cp rdma-core-27.0/suse/rdma-core.spec /usr/src/packages/SPECS/ cd /usr/src/packages/SPECS/ rpmbuild -ba rdma-core.spec --without=curlmini cd /usr/src/packages/RPMS/x86_64 rpm -ivh --force *27.0*.rpm Ubuntu: To create Debian packages from rdma-core: # Download rdma-core-27.0.tar.gz from GitHub wget https://github.com/linux-rdma/rdma-core/releases/download/v27.0/rdma-core-27.0.tar.gz # Apply patch libirdma-27.0.patch to rdma-core tar -xzvf rdma-core-27.0.tar.gz cd rdma-core-27.0 patch -p2 < /path/to/irdma-/libirdma-27.0.patch # Build rdma-core dh clean --with python3,systemd --builddirectory=build-deb dh build --with systemd --builddirectory=build-deb sudo dh binary --with python3,systemd --builddirectory=build-deb # This creates .deb packages in the parent directory # To install the .deb packages sudo dpkg -i ../*.deb 6. Reboot the server after installing the irdma driver and rdma-core packages. -------------------------- Confirm RDMA functionality -------------------------- After successful installation, RDMA devices are listed in the output of "ibv_devices". For example: # ibv_devices device node GUID ------ ---------------- rdmap175s0f0 40a6b70b6f300000 rdmap175s0f1 40a6b70b6f310000 Notes: - Device names may differ depending on OS or kernel. - Node GUID is different for the same device in iWARP vs. RoCEv2 mode. Each RDMA device is associated with a network interface. The sysfs filesystem can help show how these devices are related. For example: - To show RDMA devices associated with the "ens801f0" network interface: # ls /sys/class/net/ens801f0/device/infiniband/ rdmap175s0f0 - To show the network interface associated with the "rdmap175s0f0" RDMA device: # ls /sys/class/infiniband/rdmap175s0f0/device/net/ ens801f0 Before running RDMA applications, ensure that all hosts have IP addresses assigned to the network interface associated with the RDMA device. The RDMA device uses the IP configuration from the corresponding network interface. There is no additional configuration required for the RDMA device. To confirm RDMA functionality, run rping: 1) Start the rping server: rping -sdvVa [server IP address] 2) Start the rping client: rping -cdvVa [server IP address] -C 10 3) rping will run for 10 iterations (-C 10) and print data payloads on the console. Notes: - Confirm rping functionality both from each host to itself and between hosts. For example: * Run rping server and client both on Host A. * Run rping server and client both on Host B. * Run rping server on Host A and rping client on Host B. * Run rping server on Host B and rping client on Host A. - When connecting multiple rping clients to a persistent rping server, older kernels may experience a crash related to the handling of cm_id values in the kernel stack. With E810, this problem typically appears in the system log as a kernel oops and stack trace pointing to irdma_accept. The issue has been fixed in kernels 5.4.61 and later. For patch details, see: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/infiniband/core/ucma.c?h=v5.9-rc2&id=7c11910783a1ea17e88777552ef146cace607b3c ---------------------- iWARP/RoCEv2 Selection ---------------------- X722: The X722 adapter supports only the iWARP transport. E810: The E810 controller supports both iWARP and RoCEv2 transports. By default, the irdma driver is loaded in iWARP mode. RoCEv2 may be selected either globally (for all ports) using the module parameter "roce_ena=1" or for individual ports using the devlink interface. --- Global Selection To automatically enable RoCEv2 mode for all ports when the irdma driver is loaded, add the following line to /etc/modprobe.d/irdma.conf: options irdma roce_ena=1 The irdma driver may also be manually loaded with the "roce_ena=1" parameter on the modprobe command line. To manually load all irdma ports in RoCEv2 mode: - If the irdma driver is currently loaded, first unload it: rmmod irdma - Reload the driver in RoCEv2 mode: modprobe irdma roce_ena=1 --- Per-Port Selection E810 interfaces may be configured per interface for iWARP mode (default) or RoCEv2 via devlink parameter configuration. See the "Devlink Configuration" section below for instructions on per-port iWARP/RoCEv2 selection. ------------------------- iWARP Port Mapper (iwpmd) ------------------------- The iWARP port mapper service (iwpmd) coordinates with the host network stack and manages TCP port space for iWARP applications. iwpmd is automatically loaded when ice or i40e is loaded via udev rules in /usr/lib/udev/rules.d/90-iwpmd.rules. To verify iWARP port mapper status: systemctl status iwpmd --------------------- Flow Control Settings --------------------- X722: The X722 adapter supports only link-level flow control (LFC). E810: The E810 controller supports both link-level flow control (LFC) and priority flow control (PFC). Enabling flow control is strongly recommended when using E810 in RoCEv2 mode. --- Link Level Flow Control (LFC) (E810 and X722) To enable link-level flow control on E810 or X722, use "ethtool -A". For example, to enable LFC in both directions (rx and tx): ethtool -A DEVNAME rx on tx on Confirm the setting with "ethtool -a": ethtool -a DEVNAME Sample output: Pause parameters for interface: Autonegotiate: on RX: on TX: on RX negotiated: on TX negotiated: on Full enablement of LFC requires the switch or link partner be configured for rx and tx pause frames. Refer to switch vendor documentation for more details. --- Priority Level Flow Control (PFC) (E810 only) Priority flow control (PFC) is supported on E810 in both willing and non-willing modes. E810 also has two Data Center Bridging (DCB) modes: software and firmware. For more background on software and firmware modes, refer to the E810 ice driver README. - For PFC willing mode, firmware DCB is recommended. - For PFC non-willing mode, software DCB must be used. Note: E810 supports a maximum of 4 traffic classes (TCs), one of which may have PFC enabled. *** PFC willing mode In willing mode, E810 is "willing" to accept DCB settings from its link partner. DCB is configured on the link partner (typically a switch), and the E810 will automatically discover and apply the DCB settings to its own port. This simplifies DCB configuration in a larger cluster and eliminates the need to independently configure DCB on both sides of the link. To enable PFC in willing mode on E810, use ethtool to enable firmware DCB. Enabling firmware DCB automatically places the NIC in willing mode: ethtool --set-priv-flags DEVNAME fw-lldp-agent on To confirm settings, use following command: ethtool --show-priv-flags DEVNAME Expected output: fw-lldp-agent :on Note: When firmware DCB is enabled, the E810 NIC may experience an adapter-wide reset when the DCBX willing configuration change propagated from the link partner removes an RDMA-enabled traffic class (TC). This typically occurs when removing a TC associated with priority 0 (the default priority for RDMA). The reset results in a temporary loss of connectivity as the adapter re-initializes. Switch DCB and PFC configuration syntax varies by vendor. Consult your switch manual for details. Sample Arista switch configuration commands: - Example: Enable PFC for priority 0 on switch port 21 * Enter configuration mode for switch port 21: switch#configure switch(config)#interface ethernet 21/1 * Turn PFC on: switch(config-if-Et21/1)#priority-flow-control mode on * Set priority 0 for "no-drop" (i.e., PFC enabled): switch(config-if-Et21/1)#priority-flow-control priority 0 no-drop * Verify switch port PFC configuration: switch(config-if-Et21/1)#show priority-flow-control - Example: Enable DCBX on switch port 21 * Enable DCBX in IEEE mode: switch(config-if-Et21/1)#dcbx mode ieee * Show DCBX settings (including neighbor port settings): switch(config-if-Et21/1)#show dcbx *** PFC non-willing mode In non-willing mode, DCB settings must be configured on both E810 and its link partner. Non-willing mode is software-based. OpenLLDP (lldpad and lldptool) is recommended. To enable non-willing PFC on E810: 1. Disable firmware DCB. Firmware DCB is always willing. If enabled, it will override any software settings. ethtool --set-priv-flags DEVNAME fw-lldp-agent off 2. Install OpenLLDP yum install lldpad 3. Start the Open LLDP daemon: lldpad -d 4. Verify functionality by showing current DCB settings on the NIC: lldptool -ti 5. Configure your desired DCB settings, including traffic classes, bandwidth allocations, and PFC. The following example enables PFC on priority 0, maps all priorities to traffic class (TC) 0, and allocates all bandwidth to TC0. This simple configuration is suitable for enabling PFC for all traffic, which may be useful for back-to-back benchmarking. Datacenters will typically use a more complex configuration to ensure quality-of-service (QoS). a. Enable PFC for priority 0: lldptool -Ti -V PFC willing=no enabled=0 b. Map all priorities to TC0 and allocate all bandwidth to TC0: lldptool -Ti -V ETS-CFG willing=no \ up2tc=0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:0 \ tsa=0:ets,1:strict,2:strict,3:strict,4:strict,5:strict,6:strict,7:strict \ tcbw=100,0,0,0,0,0,0,0 6. Verify output of "lldptool -ti ": Chassis ID TLV MAC: 68:05:ca:a3:89:78 Port ID TLV MAC: 68:05:ca:a3:89:78 Time to Live TLV 120 IEEE 8021QAZ ETS Configuration TLV Willing: no CBS: not supported MAX_TCS: 8 PRIO_MAP: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 TC Bandwidth: 100% 0% 0% 0% 0% 0% 0% 0% TSA_MAP: 0:ets 1:strict 2:strict 3:strict 4:strict 5:strict 6:strict 7:strict IEEE 8021QAZ PFC TLV Willing: no MACsec Bypass Capable: no PFC capable traffic classes: 8 PFC enabled: 0 End of LLDPDU LTV 7. Configure the same settings on the link partner. Full enablement of PFC requires the switch or link partner be configured for PFC pause frames. Refer to switch vendor documentation for more details. --- Directing RDMA traffic to a traffic class When using PFC, traffic may be directed to one or more traffic classes (TCs). Because RDMA traffic bypasses the kernel, Linux traffic control methods like tc, cgroups, or egress-qos-map can't be used. Instead, set the Type of Service (ToS) field on your application command line. ToS-to-priority mappings are hardcoded in Linux as follows: ToS Priority --- -------- 0 0 8 2 24 4 16 6 Priorities are then mapped to traffic classes using ETS using lldptool or switch utilities. Examples of setting ToS 16 in an application: ucmatose -t 16 ib_write_bw -t 16 Alternatively, for RoCEv2, ToS may be set for all RoCEv2 traffic using configfs. For example, to set ToS 16 on device rdma, port 1: mkdir /sys/kernel/config/rdma_cm/rdma echo 16 > /sys/kernel/config/rdma_cm/rdma/ports/1/default_roce_tos ----------------- ECN Configuration ----------------- X722: Congestion control settings are not supported on X722 adapters. E810: The E810 controller supports the following congestion control algorithms: - iWARP DCTCP - iWARP TCP New Reno plus ECN - iWARP TIMELY - RoCEv2 DCQCN - RoCEv2 DCTCP - RoCEv2 TIMELY Congestion control settings are accessed through configfs. Additional DCQCN tunings are available through the devlink interface. See the "Devlink Configuration" section for details. --- Configuration in configfs To access congestion control settings: 1. After driver load, change to the irdma configfs directory: cd /sys/kernel/config/irdma 2. Create a new directory for each RDMA device you want to configure. Note: Use "ibv_devices" for a list of RDMA devices. For example, to create configfs entries for the rdmap device: mkdir rdmap 3. List the new directory to get its dynamic congestion control knobs and values: cd rdmap for f in *; do echo -n "$f: "; cat "$f"; done; If the interface is in iWARP mode, the files have a "iw_" prefix: - iw_dctcp_enable - iw_ecn_enable - iw_timely_enable If the interface is in RoCEv2 mode, the files have a "roce_" prefix: - roce_dcqcn_enable - roce_dctcp_enable - roce_timely_enable 4. Enable or disable the desired algorithms. To enable an algorithm: echo 1 > For example, to add ECN marker processing to the default TCP New Reno iWARP congestion control algorithm: echo 1 > /sys/kernel/config/irdma/rdmap/iw_ecn_enable To disable an algorithm: echo 0 > For example: echo 0 > /sys/kernel/config/irdma/rdmap/iw_ecn_enable To read the current status: cat Default values: iwarp_dctcp_en: off iwarp_timely_en: off iwarp_ecn_en: ON roce_timely_en: off roce_dctcp_en: off roce_dcqcn_en: off 5. Remove the configfs directory created above. Without removing these directories, the driver will not unload. rmdir /sys/kernel/config/irdma/rdmap --------------------- Devlink Configuration --------------------- X722: Devlink parameter configuration is not supported on the X722 adapters. E810: The E810 controller supports devlink configuration for the following controls: - iWARP/RoCEv2 per-port selection - DCQCN congestion control tunings - Fragment count limit --- Devlink OS support Devlink dev parameter configuration is a recent Linux capability that requires both iproute2 tool support as well as kernel support. The following OS/Kernel versions support devlink dev parameters: - RHEL 8 or later - SLES 15 SP1 or later - Ubuntu 18.04 or later - Linux kernel 4.19 or later iproute2 may need to be updated to add parameter capability to the devlink configuration. The latest released version can be downloaded and installed from: https://github.com/shemminger/iproute2/releases --- Devlink parameter configuration (E810 only) 1. Get PCIe bus-info of the desired interface using "ethtool -i": ethtool -i DEVNAME Example: # ethtool -i enp175s0f0 driver: ice version: 0.11.7 firmware-version: 0.50 0x800019de 1.2233.0 expansion-rom-version: bus-info: 0000:af:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes bus-info is 0000:af:00.0 2. Find the devlink name 'ice_rdma.x' in the /sys/devices folder: ls /sys/devices/*/*// | grep ice_rdma Example: ls /sys/devices/*/*/0000:af:00.0/ | grep ice_rdma ice_rdma.16 3. To display available parameters: devlink dev param show RDMA devlink parameters for E810: roce_enable Selects RDMA transport: RoCEv2 (true) or iWARP (false) resource_limits_selector Limits available queue pairs (QPs). See "Resource Limits Selector" section for details and values. dcqcn_enable Enables the DCQCN algorithm for RoCEv2. Note: "roce_enable" must also be set to "true". dcqcn_cc_cfg_valid Indicates that all DCQCN parameters are valid and should be updated in registers or QP context. dcqcn_min_dec_factor The minimum factor by which the current transmit rate can be changed when processing a CNP. Value is given as a percentage (1-100). dcqcn_min_rate The minimum value, in Mbits per second, for rate to limit. dcqcn_F The number of times to stay in each stage of bandwidth recovery. dcqcn_T The number of microseconds that should elapse before increasing the CWND in DCQCN mode. dcqcn_B The number of bytes to transmit before updating CWND in DCQCN mode. dcqcn_rai_factor The number of MSS to add to the congestion window in additive increase mode. dcqcn_hai_factor The number of MSS to add to the congestion window in hyperactive increase mode. dcqcn_rreduce_mperiod The minimum time between 2 consecutive rate reductions for a single flow. Rate reduction will occur only if a CNP is received during the relevant time interval. fragment_count_limit Set fragment count limit to adjust maximum values for queue depth and inline data size. 4. To set a parameter: devlink dev param set platform/ name value cmode driverinit Example: Enable RoCEv2, enable DCQCN, and set min_dec_factor=5 on ice_rdma.17: devlink dev param set platform/ice_rdma.17 name roce_enable value true cmode driverinit devlink dev param set platform/ice_rdma.17 name dcqcn_enable value true cmode driverinit devlink dev param set platform/ice_rdma.17 name dcqcn_min_dec_factor value 5 cmode driverinit 5. Reload the device port with new mode: devlink dev reload platform/ Example: devlink dev reload platform/ice_rdma.16 Note: This does not reload the driver, so other ports are unaffected. ------------------- Memory Requirements ------------------- Default irdma initialization requires a minimum of ~210 MB (for E810) or ~160 MB (for X722) of memory per port. For servers where the amount of memory is constrained, you can decrease the required memory by lowering the resources available to E810 or X722 by loading the driver with the following resource profile setting: modprobe irdma rsrc_profile=2 To automatically apply the setting when the driver is loaded, add the following to /etc/modprobe.d/irdma.conf: options irdma rsrc_profile=2 Note: This can have performance and scaling impacts as the number of queue pairs and other RDMA resources are decreased in order to lower memory usage to approximately 55 MB (for E810) or 51 MB (for X722) per port. ----------------------- Resource Profile Limits ----------------------- In the default resource profile, the RDMA resources configured for each adapter are as follows: E810 (2 ports): Queue Pairs: 4092 Completion Queues: 8189 Memory Regions: 4194302 X722 (4 ports): Queue Pairs: 1020 Completion Queues: 2045 Memory Regions: 2097150 For resource profile 2, the configuration is: E810 (2 ports): Queue Pairs: 508 Completion Queues: 1021 Memory Regions: 524286 X722 (4 ports): Queue Pairs: 252 Completion Queues: 509 Memory Regions: 524286 ------------------------ Resource Limits Selector ------------------------ In addition to resource profile, you can further limit resources via the "limits_sel" module parameter: E810: modprobe irdma limits_sel=<0-6> X722: modprobe irdma gen1_limits_sel=<0-5> To automatically apply this setting when the driver is loaded, add the following to /etc/modprobe.d/irdma.conf: options irdma limits_sel= The values below apply to a 2-port E810 NIC. 0 - Default, up to 4092 QPs 1 - Minimum, up to 124 QPs 2 - Up to 1020 QPs 3 - Up to 2044 QPs 4 - Up to 16380 QPs 5 - Up to 65532 QPs 6 - Maximum, up to 131068 QPs For X722, the resource limit selector defaults to a value of 2. A single port supports a maximum of 64k QPs, and a 4-port X722 supports up to 16k QPs per port. --------------- RDMA Statistics --------------- RDMA protocol statistics for E810 or X722 are found in sysfs. To display all counters and values: cd /sys/class/infiniband/rdmap/hw_counters; for f in *; do echo -n "$f: "; cat "$f"; done; The following counters will increment when RDMA applications are transferring data over the network in iWARP mode: - tcpInSegs - tcpOutSegs Available counters: ip4InDiscards IPv4 packets received and discarded. ip4InReasmRqd IPv4 fragments received by Protocol Engine. ip4InMcastOctets IPv4 multicast octets received. ip4InMcastPkts IPv4 multicast packets received. ip4InOctets IPv4 octets received. ip4InPkts IPv4 packets received. ip4InTruncatedPkts IPv4 packets received and truncated due to insufficient buffering space in UDA RQ. ip4OutSegRqd IPv4 fragments supplied by Protocol Engine to the lower layers for transmission ip4OutMcastOctets IPv4 multicast octets transmitted. ip4OutMcastPkts IPv4 multicast packets transmitted. ip4OutNoRoutes IPv4 datagrams discarded due to routing problem (no hit in ARP table). ip4OutOctets IPv4 octets supplied by the PE to the lower layers for transmission. ip4OutPkts IPv4 packets supplied by the PE to the lower layers for transmission. ip6InDiscards IPv6 packets received and discarded. ip6InReasmRqd IPv6 fragments received by Protocol Engine. ip6InMcastOctets IPv6 multicast octets received. ip6InMcastPkts IPv6 multicast packets received. ip6InOctets IPv6 octets received. ip6InPkts IPv6 packets received. ip6InTruncatedPkts IPv6 packets received and truncated due to insufficient buffering space in UDA RQ. ip6OutSegRqd IPv6 fragments received by Protocol Engine ip6OutMcastOctets IPv6 multicast octets transmitted. ip6OutMcastPkts IPv6 multicast packets transmitted. ip6OutNoRoutes IPv6 datagrams discarded due to routing problem (no hit in ARP table). ip6OutOctets IPv6 octets supplied by the PE to the lower layers for transmission. ip6OutPkts IPv6 packets supplied by the PE to the lower layers for transmission. iwInRdmaReads RDMAP total RDMA read request messages received. iwInRdmaSends RDMAP total RDMA send-type messages received. iwInRdmaWrites RDMAP total RDMA write messages received. iwOutRdmaReads RDMAP total RDMA read request messages sent. iwOutRdmaSends RDMAP total RDMA send-type messages sent. iwOutRdmaWrites RDMAP total RDMA write messages sent. iwRdmaBnd RDMA verbs total bind operations carried out. iwRdmaInv RDMA verbs total invalidate operations carried out. RxECNMrkd Number of packets that have the ECN bits set to indicate congestion cnpHandled Number of Congestion Notification Packets that have been handled by the reaction point. cnpIgnored Number of Congestion Notification Packets that have been ignored by the reaction point. rxVlanErrors Ethernet received packets with incorrect VLAN_ID. tcpRetransSegs Total number of TCP segments retransmitted. tcpInOptErrors TCP segments received with unsupported TCP options or TCP option length errors. tcpInProtoErrors TCP segments received that are dropped by TRX due to TCP protocol errors. tcpInSegs TCP segments received. tcpOutSegs TCP segments transmitted. cnpSent Number of Congestion Notification Packets that have been sent by the reaction point. RxUDP UDP segments received without errors TxUDP UDP segments transmitted without errors -------- perftest -------- The perftest package is a set of RDMA microbenchmarks designed to test bandwidth and latency using RDMA verbs. The package is maintained upstream here: https://github.com/linux-rdma/perftest perftest-4.4-0.29 is recommended. Earlier versions of perftest had known issues with iWARP that have since been fixed. Versions 4.4-0.4 through 4.4-0.18 are therefore NOT recommended. To run a basic ib_write_bw test: 1. Start server ib_write_bw -R 2. Start client: ib_write_bw -R 3. Benchmark will run to completion and print performance data on both client and server consoles. Notes: - The "-R" option is required for iWARP and optional for RoCEv2. - Use "-d " on the perftest command lines to use a specific RDMA device. - For ib_read_bw, use "-o 1" for testing with 3rd-party link partners. - For ib_send_lat and ib_write lat, use "-I 96" to limit inline data size to the supported value. - iWARP supports only RC connections. RoCEv2 supports RC and UD. Connection types XRC, UC, and DC are not supported. - Atomic operations are not supported on E810 or X722. ----------- MPI Testing ----------- --- Intel MPI Intel MPI uses the OpenFabrics Interfaces (OFI) framework and libfabric user space libraries to communicate with network hardware. * Recommended Intel MPI versions: Single-rail: Intel MPI 2019u8 Multi-rail: Intel MPI 2019u3 Note: Intel MPI 2019u4 is not recommended due to known incompatabilites with iWARP. * Recommended libfabric version: libfabric-1.11.0 The Intel MPI package includes a version of libfabric. This "internal" version is automatically installed along with Intel MPI and used by default. To use a different ("external") version of libfabric with Intel MPI: 1. Download libfabric from https://github.com/ofiwg/libfabric. 2. Build and install it according to the libfabric documentation. 3. Configure Intel MPI to use a non-internal version of libfabric: export I_MPI_OFI_LIBRARY_INTERNAL=0 or source /intel64/bin/mpivars.sh -ofi_internal=0 4. Verify your libfabric version by using the I_MPI_DEBUG environment variable on the mpirun command line: -genv I_MPI_DEBUG=1 The libfabric version will appear in the mpirun output. * Sample command line for a 2-process pingpong test: mpirun -l -n 2 -ppn 1 -host myhost1,myhost2 -genv I_MPI_DEBUG=5 \ -genv FI_VERBS_MR_CACHE_ENABLE=1 -genv FI_VERBS_IFACE= \ -genv FI_OFI_RXM_USE_SRX=0 -genv FI_PROVIDER='verbs;ofi_rxm' \ /path/to/IMB-MPI1 Pingpong Notes: - Example is for libfabrics 1.8 or greater. For earlier versions, use "-genv FI_PROVIDER='verbs'" - For Intel MPI 2019u6, use "-genv MPIR_CVAR_CH4_OFI_ENABLE_DATA=0". - When using Intel MPI, it's recommended to enable only one interface on your networking device to avoid MPI application connectivity issues or hangs. This issue affects all Intel MPI transports, including TCP and RDMA. To avoid the issue, use "ifdown " or "ip link set down " to disable all network interfaces on your adapter except for the one used for MPI. --- OpenMPI * OpenMPI version 4.0.3 is recommended. ----------- Performance ----------- RDMA performance may be optimized by adjusting system, application, or driver settings. - Flow control is required for best performance in RoCEv2 mode and is optional in iWARP mode. Both link-level flow control (LFC) and priority flow control (PFC) are supported, but PFC is recommended. See the "Flow Control Settings" section of this document for configuration details. - For bandwidth applications, multiple queue pairs (QPs) are required for best performance. For example, in the perftest suite, use "-q 8" on the command line to run with 8 QP. - For best results, configure your application to use CPUs on the same NUMA node as your adapter. For example: * To list CPUs local to your NIC: cat /sys/class/infiniband//device/local_cpulist * To specify CPUs (e.g., 27-47) when running a perftest application: taskset -c 24-47 ib_write_bw * To specify CPUs when running an Intel MPI application: mpirun -genv I_MPI_PIN_PROCESSOR_LIST=24-47 ./my_prog - For some workloads, latency may be improved by enabling push_mode in the irdma driver. * Create the configfs directory for your RDMA device: mkdir /sys/kernel/config/irdma/rdmap * Enable push_mode: echo 1 > /sys/kernel/config/irdma/rdmap/push_mode * Remove the directory rmmir /sys/kernel/config/irdma/rdmap - System and BIOS tunings may also improve performance. Settings vary by platform - consult your OS and BIOS documentation for details. In general: * Disable power-saving features such as P-states and C-states * Set BIOS CPU power policies to "Performance" or similar * Set BIOS CPU workload configuration to "I/O Sensitive" or similar * On RHEL 7.*/8.*, use the "latency-performance" tuning profile: tuned-adm profile latency-performance ---------------- Interoperability ---------------- --- Mellanox E810 and X722 support interop with Mellanox RoCEv2-capable adapters. In tests like ib_send_bw, use -R option to select rdma_cm for connection establishment. You can also use gid-index with -x option instead of -R: Example: On E810 or X722: ib_send_bw -F -n 5 -x 0 On Mellanox: ib_send_bw -F -n 5 -x ...where x specifies the gid index value for RoCEv2. Look in /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types directory for port 1. Note: Using RDMA reads with Mellanox may result in poor performance if there is packet loss. --- Chelsio X722 supports interop with Chelsio iWARP devices. Load Chelsio T4/T5 RDMA driver (iw_cxgb4) with parameter "dack_mode" set to 0. modprobe iw_cxgb4 dack_mode=0 To automatically apply this setting when the iw_cxgb4 driver is loaded, add the following to /etc/modprobe.d/iw_cxgb4.conf: options iw_cxgb4 dack_mode=0 --------------- Dynamic Tracing --------------- Dynamic tracing is available for irdma's connection manager. Turn on tracing with the following command: echo 1 > /sys/kernel/debug/tracing/events/irdma_cm/enable To retrieve the trace: cat /sys/kernel/debug/tracing/trace ------------- Dynamic Debug ------------- irdma support Linux dynamic debug. To enable all dynamic debug messages upon irdma driver load, use the "dyndbg" module parameter: modprobe irdma dyndbg='+p' Debug messages will then appear in the system log or dmesg. Enabling dynamic debug can be extremely verbose and is not recommended for normal operation. For more info on dynamic debug, including tips on how to refine the debug output, see: https://www.kernel.org/doc/html/v4.11/admin-guide/dynamic-debug-howto.html ----------------------------------- Capturing RDMA Traffic with tcpdump ----------------------------------- RDMA traffic bypasses the kernel and is not normally available to the Linux tcpdump utility. You may capture RDMA traffic with tcpdump by using port mirroring on a switch. 1. Connect 3 hosts to a switch: - 2 compute nodes to run RDMA traffic - 1 host to monitor traffic 2. Configure the switch to mirror traffic from one compute node's switch port to the monitoring host's switch port. Consult your switch documentation for syntax. 3. Unload the irdma driver on the monitoring host: # rmmod irdma Traffic may not be captured correctly if the irdma driver is loaded. 4. Start tcpdump on the monitoring host. For example: # tcpdump -nXX -i 5. Run RDMA traffic between the 2 compute nodes. RDMA packets will appear in tcpdump on the monitoring host. ------------------- Known Issues/Notes ------------------- X722: * Support for the Intel(R) Ethernet Connection X722 iWARP RDMA VF driver (i40iwvf) has been discontinued. * There may be incompatible drivers in the initramfs image. You can either update the image or remove the drivers from initramfs. Specifically, look for i40e, ib_addr, ib_cm, ib_core, ib_mad, ib_sa, ib_ucm, ib_uverbs, iw_cm, rdma_cm, rdma_ucm in the output of the following command: lsinitrd |less If you see any of those modules, rebuild initramfs with the following command and include the name of the module in the "" list. For example: dracut --force --omit-drivers "i40e ib_addr ib_cm ib_core ib_mad ib_sa ib_ucm ib_uverbs iw_cm rdma_cm rdma_ucm" E810: * Linux SRIOV for RDMA on E810 is currently not supported. * RDMA is not supported when E810 is configured for more than 4 ports. * E810 is limited to 4 traffic classes (TCs), one of which may be enabled for priority flow control (PFC). * When using RoCEv2 on Linux kernel version 5.9 or earlier, some iSER operations may experience errors related to iSER's handling of work requests. To work around this issue, set the E810 fragment_count_limit devlink parameter to 13. Refer to the "Devlink Configuration" section for details on setting devlink parameters. X722 and E810: * Some commands (such as 'tc qdisc add' and 'ethtool -L') will cause the ice driver to close the associated RDMA interface and reopen it. This will disrupt RDMA traffic for a few seconds until the RDMA interface is available again. * NOTE: Installing the ice driver, on RHEL, currently installs ice into initrd. The implication is that the ice driver will be loaded on boot. The installation process will also install any currently installed version of irdma into initrd. This might result in an unintended version of irdma being installed. Depending on your desired configuration and behavior of ice and irdma please look at the following instructions to ensure the desired drivers are installed correctly. A. Desired that both ice and irdma are loaded on boot (default) 1. Follow installation procedure for the ice driver 2. Follow installation procedure for the irdma driver B. Desired that only ice driver is loaded on boot 1. Untar ice driver 2. Follow installation procedure for ice driver 3. Untar irdma driver 4. Follow installation procedure for irdma driver 5. % dracut --force --omit-drivers "irdma" C. Desired that neither ice nor irdma is loaded on boot 1. Perform all steps in B 2. % dracut --force --omit-drivers "ice irdma" ------- Support ------- For general information, go to the Intel support website at: http://www.intel.com/support/ or the Intel Wired Networking project hosted by Sourceforge at: http://sourceforge.net/projects/e1000 If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue to e1000-rdma@lists.sourceforge.net ------- License ------- This software is available to you under a choice of one of two licenses. You may choose to be licensed under the terms of the GNU General Public License (GPL) Version 2, available from the file COPYING in the main directory of this source tree, or the OpenFabrics.org BSD license below: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ---------- Trademarks ---------- Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and/or other countries. * Other names and brands may be claimed as the property of others