i40iw Linux* Driver for Intel(R) Ethernet Connection X722 =============================================================================== April 6, 2018 =============================================================================== Contents -------- - Prerequisites - Building and Installation - Testing - Virtualization - Interoperability - RDMA Statistics - Known Issues ================================================================================ Prerequisites ------------- - A supported kernel configuration, choose from the following: 1) A Linux distribution supported by OFED 3.18-3 or OFED 4.8(recommended). Use OFED if it is required by software you wish to run. 2) An upstream kernel v4.8-v4.14, if you require fixes not in OFED. For example, NVMe over Fabrics (NVMeoF). 3) RHEL 7.4 or SLES 12 SP3 with infiniband support installed, if you do not want to install OFED or upstream kernel. - For OFED 3.18-3/OFED 4.8, it should be install with ./install.pl --all - For OFED 4.8 or Linux Kernel v4.8-v4.14, download and install the latest rdma_core from https://github.com/linux-rdma/rdma-core/releases NOTE: Internet Wide Area RDMA Protocol (iWARP) is not supported with the i40iwvf driver running on Microsoft* Hyper-V. Building and Installation ------------------------- OFED 3.18-3 ----------- 1. Untar i40iw-.tar.gz, i40iwvf-.tar.gz and libi40iw- .tar.gz. 2. Install the PF driver as follows: cd i40iw- directory ./build.sh 3 For example: ./build.sh /opt/i40e-2.3.3 3. Install the VF driver as follows: cd i40iwvf- directory ./build.sh 3 For example: ./build.sh /opt/i40evf-3.2.3 4. Install user-space library as follows: cd libi40iw- ./build.sh OFED 4.8 -------- 1. Untar i40iw-.tar.gz and i40iwvf-.tar.gz. 2. Install the PF driver as follows: cd i40iw- directory ./build.sh 4 For example: ./build.sh /opt/i40e-2.3.3 4 3. Install the VF driver as follows: cd i40iwvf- directory ./build.sh 4 For example: ./build.sh /opt/i40evf-3.2.3 4 4. OFED 4.8 comes with an older version of rdma_core user-space package Please download the latest from https://github.com/linux-rdma/rdma-core/ releases and follow its installation procedure. Linux Kernel v4.8-v4.14/RHEL 7.4/SLES 12 SP3 -------------------------------------------- 1. Untar i40iw-.tar.gz and i40iwvf-.tar.gz. 2. Install the PF driver as follows: cd i40iw- directory ./build.sh k For example: ./build.sh /opt/i40e-2.3.3 k 3. Install the VF driver as follows: cd i40iwvf- directory ./build.sh k For example: ./build.sh /opt/i40evf-3.2.3 k 4. Please download the latest rdma_core user-space package from https://github.com/linux-rdma/rdma-core/releases and follow its installation procedure. Adapter and Switch Flow Control Setting --------------------------------------- We recommend enabling link-level flow control (both TX and RX) on X722 and connected switch. To enable flow control on X722 use ethtool -A command. For example: ethtool -A p4p1 rx on tx on Confirm the setting with ethtool -a command. For example: ethtool -a p4p1 You should see this output: Pause parameters for p4p1: Autonegotiate: off X: on TX: on To enable link-level flow control on the switch, please consult your switch vendor's documentation. Look for flow-control and make sure both TX and RX are set. Here is an example for a generic switch to enable both TX and RX flow control on port 45: enable flow-control tx-pause ports 45 enable flow-control rx-pause ports 45 ================================================================================ Virtualization -------------- To enable SR-IOV support, load i40iw with the following parameters and then create VFs with i40e. Note: This may have performance and scaling impacts as the number of queue pairs and other RDMA resources are decreased. resource_profile=2 max_rdma_vfs= For example: modprobe i40iw resource_profile=2 max_rdma_vfs=32 NOTE: Once the VFs are running, do not change the PF configuration. Interoperability ---------------- To interoperate with Chelsio iWARP devices with OFED 4.8 or Linux Kernels v4.8-v4.14: Load Chelsio T4/T5 RDMA driver (iw_cxgb4) with parameter dack_mode set to 0. modprobe iw_cxgb4 dack_mode=0 If iw_cxgb4 is loaded on system boot, create /etc/modprobe.d/iw_cxgb4.conf file with the following entry: options iw_cxgb4 dack_mode=0 Reload iw_cxgb4 for the new parameters to take effect. RDMA Statistics --------------- Use the following command to read RDMA Protocol statistics: cd /sys/class/infiniband/i40iw0/proto_stats; for f in *; do echo -n "$f: "; cat "$f"; done; cd The following counters will increment when RDMA applications are transferring data over the network: - ipInReceives - tcpInSegs - tcpOutSegs Memory Requirements: -------------------- Default i40iw load requires a minimum of 6GB of memory for initialization. For applications where the amount of memory is constrained, you can decrease the required memory by lowering the available resources to the i40iw driver. To do this, load the driver with the following profile setting. Note: This can have performance and scaling impacts as the number of queue pairs and other RDMA resources are decreased in order to lower memory usage to approximately 1.2 GB. modprobe i40iw resource_profile=2 Scaling Limits -------------- Intel(R) Ethernet Connection X722 has limited RDMA resources, including the number of Queue Pairs (QPs), Completion Queues (CQs) and Memory Regions (MRs). In highly scaled environments or highly interconnected HPC-style applications such as all-to-all, users may experience QP failure errors once they reach the RDMA resource limits. Below are the per-physical port limits for 4-port devices for the three resources associated with the default i40iw driver load: QPs: 16384 CQs: 32768 MRs: 2453503 Other resource profiles allocate resources differently. If the i40iw is loaded with resource_profile 2, then resources will be more limited. The example below shows the resource limit per-physical port when you use modprobe i40iw resource_profile 2. (Note that this may increase if you load fewer than 32 VFs using the max_rdma_vfs module parameter.) QPs: 2048 CQs:3584 MRs: 6143 Flow Control Recommendation --------------------------- For better performance, enable flow control on all the nodes and on the switch they are connected to. To enable flow control on a node, run: ethtool -A rx on tx on ======================================== Recommended Settings for Intel MPI 2017.0.x ======================================== Note: The following instructions assume that Intel MPI is installed using default locations. Refer to Intel MPI documentation for further details on parameters and general instructions. 1. Add or modify the following line in /etc/dat.conf, changing to match your interface name: ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 " 0" "" 2. To select the iWARP device, add the following to mpiexec command: -genv I_MPI_FALLBACK_DEVICE disable -genv I_MPI_DEVICE rdma:ofa-v2-iwarp Example mpiexec command line for uDAPL-2.0: mpiexec -machinefile mpd.hosts_impi -genv I_MPI_FALLBACK_DEVICE disable -genv I_MPI_DEVICE rdma:ofa-v2-iwarp -ppn -n Note: mpd.hosts_impi is a text file with a list of the nodes' qualified hostnames or IP addresses, one per line, in the MPI ring. Note: Recommended optional_parameters if running IMB-MPI1 benchmark: -time 1000000 (specifies that a benchmark will run at most that many seconds per message size) -mem 2GB (specifies that at most that many GBytes are allocated per process for the message buffers) ======================================== Recommended Settings for Open MPI 3.x.x ======================================== Note: The following instructions assume that Open MPI is installed using default locations. Refer to Open MPI documentation at open-mpi.org for further details on parameters and general instructions. Note: There is more than one way to specify MCA parameters in OpenMPI. Please visit this link and use the best method for your environment: http://www.open-mpi.org/faq/?category=tuning#setting-mca-params Necessary parameters to mpirun command: -mca btl openib,self,vader Use openib (Open Fabrics device), send to self semantics and shared memory. -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128 Set the receive queue sizes. This is especially useful for interop between iWARP RDMA vendors, because the queue sizes could be different per vendor in the file “openmpi/mca-btl-openib-device-params.ini" -mca oob ^ud Do not use UD QPs Example mpirun command line: mpirun -np -hostfile mpd.hosts_ompi --map-by node --allow-run-as-root --display-map -v -tag-output -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128 -mca btl openib,self,vader -mca btl_mpi_leave_pinned 0 -mca oob ^ud /openmpi_benchmarks/3.x.x/benchmark [optional_parameters] Note: mpd.hosts_ompi is a text file with a list of the nodes' qualified hostnames or IP addresses and "slots=", one per line, in the MPI ring. The slots parameter is required for greater than 72. Refer to openMPI documentation for more details. Note: underscores are not allowed in hostnames. Example: QA0094-1-0 slots=72 QA0096-1-0 slots=72 Recommended optional_parameters for IMB-MPI1 benchmark: -time 1000000 (specifies that a benchmark will run at most that many seconds per message size) ================================================================================ Known Issues/Troubleshooting ---------------------------- * You may experience a kernel crash using OFED 3.18-3 under heavy load. It is fixed in upstream kernel with commit dafb558717. Incompatible Drivers in initramfs --------------------------------- There may be incompatible drivers in the initramfs image. You can either update the image or remove the drivers from initramfs. Specifically look for i40e, ib_addr, ib_cm, ib_core, ib_mad, ib_sa, ib_ucm, ib_uverbs, iw_cm, rdma_cm, rdma_ucm in the output of the following command: lsinitrd |less If you see any of those modules, rebuild initramfs with the following command and include the name of the module in the "" list. Below is an example: dracut --force --omit-drivers "i40e ib_addr ib_cm ib_core ib_mad ib_sa ib_ucm ib_uverbs iw_cm rdma_cm rdma_ucm" ================================================================================ Support ------- For general information, go to the Intel support website at: http://www.intel.com/support/ or the Intel Wired Networking project hosted by Sourceforge at: http://sourceforge.net/projects/e1000 If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue to e1000-rdma@lists.sourceforge.net ================================================================================ License ------- This software is available to you under a choice of one of two licenses. You may choose to be licensed under the terms of the GNU General Public License (GPL) Version 2, available from the file COPYING in the main directory of this source tree, or the OpenFabrics.org BSD license below: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================================================ Trademarks ---------- Intel and Itanium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and/or other countries. * Other names and brands may be claimed as the property of others.