Sunday, March 28, 2010

Understanding Jumbo Frames !!

Whether or not Gigabit Ethernet (and beyond) should support frame sizes (i.e. packets) larger than 1500 bytes has been a topic of great debate. With the explosive growth of Gigabit ethernet, the impact of this decision is critically important and will affect Internet performance for years to come.

Most of the debate about jumbo frames has focused on local area network performance and the impact that frame size has on host processing requirements, interface cards, memory, etc. But what is less well known, and of critical concern for high performance computing, is the impact that frame size has on wide area network performance. This document discusses why you should care, and about the largely ignored but important impact that frame size has on the wide area performance of TCP.

How jumbo is a jumbo frame anyway?

Ethernet has used 1500 byte frame sizes since it was created (around 1980). To maintain backward compatibility, 100 Mbps ethernet used the same size, and today "standard" gigabit ethernet is also using 1500 byte frames. This is so a packet to/from any combination of 10/100/1000 Mbps ethernet devices can be handled without any layer two fragmentation or reassembly.

"Jumbo frames" extends ethernet to 9000 bytes. Why 9000? First because ethernet uses a 32 bit CRC that loses its effectiveness above about 12000 bytes. And secondly, 9000 was large enough to carry an 8 KB application datagram (e.g. NFS) plus packet header overhead. Is 9000 bytes enough? It's a lot better than 1500, but for pure performance reasons there is little reason to stop there. At 64 KB we reach the limit of an IPv4 datagram, while IPv6 allows for packets up to 4 GB in size. For ethernet however, the 32 bit CRC limit is hard to change, so don't expect to see ethernet frame sizes above 9000 bytes anytime soon.

How can jumbo frames and 1500 byte frames coexist?

Two basic approaches exist:

* On a port by port basis, where everything "downstream" from a given port is known to support jumbo frames.
* Using 802.1q Virtual LANs, where jumbo frame and non-jumbo frame devices are segregated to different VLANs.

Jumbo frames bad for multimedia?

For applications that are sensitive to burst drops, delay jitter, etc., it can be argued that large frames are a bad idea. No application has to use large frames however, so the question is really whether other application's large frames will negatively impact your application's small ones. This is primarily an issue of slot time, i.e. how much will a large packet delay (or quantize) the time(s) available to transmit the small packets.

A 9000 byte GigE packet takes the same amount of time to transmit as a 900 byte fast ethernet packet or a 90 byte 10 Mbps ethernet packet. So jumbo frames on gigabit ethernet at worse add less delay variation than 1500 byte frames do on slower ethernets. And no one is suggesting that slower ethernets use 9000 byte frames. As for queueing delay concerns, that could happen whether packets are large or small. If delivery QoS is required, then the routers need to implement some kind of priority or expedited forwarding, regardless of the packet sizes. Tiny frames (including 53 byte ATM cells) may be helpful when multiplexing lower bit rate streams, but they become increasingly ridiculous on gigabit and beyond links.

Understanding iperf?

Iperf is a tool to measure the bandwidth and the quality of a network link. Jperf can be associated with Iperf to provide a graphical frontend written in Java.

The network link is delimited by two hosts running Iperf.

The quality of a link can be tested as follows:

- Latency (response time or RTT): can be measured with the ping command.
- Jitter (latency variation): can be measured with an Iperf UDP test.
- Datagram loss: can be measured with an Iperf UDP test.

The bandwidth is measured through TCP tests.

To be clear, the difference between TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) is that TCP use processes to check that the packets are correctly sent to the receiver whereas with UDP the packets are sent without any checks but with the advantage of being quicker than TCP.
Iperf uses the different capacities of TCP and UDP to provide statistics about network links.

Finally, Iperf can be installed very easily on any UNIX/Linux or Microsoft Windows system. One host must be set as client, the other one as server.


Here is a diagram where Iperf is installed on a Linux and Microsoft Windows machine.
Linux is used as the Iperf client and Windows as the Iperf server. Of course, it is also possible to use two Linux boxes.

screenshot Iperf bandwidth measure client server

Iperf tests:

no arg.
-b
-r
-d
-w
Default settings
Data format
Bi-directional bandwidth
Simultaneous bi-directional bandwidth
TCP Window size
-p, -t, -i
-u, -b
-m
-M
-P
-h Port, timing and interval
UDP tests, bandwidth settings
Maximum Segment Size display
Maximum Segment Size settings
Parallel tests
help
Jperf:

no arg.
-d
-u, -b Default settings
Simultaneous bi-directional bandwidth
UDP tests, bandwidth settings


Default Iperf settings:
Also check "Jperf section.

By default, the Iperf client connects to the Iperf server on the TCP port 5001 and the bandwidth displayed by Iperf is the bandwidth from the client to the server.
If you want to use UDP tests, use the -u argument.
The -d and -r Iperf client arguments measure the bi-directional bandwidths. (See further on this tutorial)

Client side:

#iperf -c 10.1.1.1
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16384 Byte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 33453 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.2 sec 1.26 MBytes 1.05 Mbits/sec

Server side:

#iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 33453
[ ID] Interval Transfer Bandwidth
[852] 0.0-10.6 sec 1.26 MBytes 1.03 Mbits/sec


Data formatting: (-f argument)

The -f argument can display the results in the desired format: bits(b), bytes(B), kilobits(k), kilobytes(K), megabits(m), megabytes(M), gigabits(g) or gigabytes(G).
Generally the bandwidth measures are displayed in bits (or Kilobits, etc ...) and an amount of data is displayed in bytes (or Kilobytes, etc ...).
As a reminder, 1 byte is equal to 8 bits and, in the computer science world, 1 kilo is equal to 1024 (2^10).
For example: 100'000'000 bytes is not equal to 100 Mbytes but to 100'000'000/1024/1024 = 95.37 Mbytes.

Client side:

#iperf -c 10.1.1.1 -f b
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16384 Byte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 54953 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.2 sec 1359872 Bytes 1064272 bits/sec

Server side:

#iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 33453
[ ID] Interval Transfer Bandwidth
[852] 0.0-10.6 sec 920 KBytes 711 Kbits/sec

Top of the page


Bi-directional bandwidth measurement: (-r argument)

The Iperf server connects back to the client allowing the bi-directional bandwidth measurement. By default, only the bandwidth from the client to the server is measured.
If you want to measure the bi-directional bandwidth simultaneously, use the -d keyword. (See next test.)

Client side:

#iperf -c 10.1.1.1 -r
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 5] local 10.6.2.5 port 35726 connected with 10.1.1.1 port 5001
[ 5] 0.0-10.0 sec 1.12 MBytes 936 Kbits/sec
[ 4] local 10.6.2.5 port 5001 connected with 10.1.1.1 port 1640
[ 4] 0.0-10.1 sec 74.2 MBytes 61.7 Mbits/sec

Server side:

#iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 54355
[ ID] Interval Transfer Bandwidth
[852] 0.0-10.1 sec 1.15 MBytes 956 Kbits/sec
------------------------------------------------------------
Client connecting to 10.6.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[824] local 10.1.1.1 port 1646 connected with 10.6.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[824] 0.0-10.0 sec 73.3 MBytes 61.4 Mbits/sec

Top of the page


Simultaneous bi-directional bandwidth measurement: (-d argument)
Also check the "Jperf" section.

To measure the bi-directional bandwidths simultaneousely, use the -d argument. If you want to test the bandwidths sequentially, use the -r argument (see previous test).
By default (ie: without the -r or -d arguments), only the bandwidth from the client to the server is measured.

Client side:

#iperf -c 10.1.1.1 -d
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 5] local 10.6.2.5 port 60270 connected with 10.1.1.1 port 5001
[ 4] local 10.6.2.5 port 5001 connected with 10.1.1.1 port 2643
[ 4] 0.0-10.0 sec 76.3 MBytes 63.9 Mbits/sec
[ 5] 0.0-10.1 sec 1.55 MBytes 1.29 Mbits/sec

Server side:

#iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 60270
------------------------------------------------------------
Client connecting to 10.6.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[800] local 10.1.1.1 port 2643 connected with 10.6.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[800] 0.0-10.0 sec 76.3 MBytes 63.9 Mbits/sec
[852] 0.0-10.1 sec 1.55 MBytes 1.29 Mbits/sec

Top of the page


TCP Window size: (-w argument)

The TCP window size is the amount of data that can be buffered during a connection without a validation from the receiver.
It can be between 2 and 65,535 bytes.
On Linux systems, when specifying a TCP buffer size with the -w argument, the kernel allocates double as much as indicated.

Client side:

#iperf -c 10.1.1.1 -w 2000
WARNING: TCP window size set to 2000 bytes. A small window size
will give poor performance. See the Iperf documentation.
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 3.91 KByte (WARNING: requested 1.95 KByte)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 51400 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.1 sec 704 KBytes 572 Kbits/sec

Server side:

#iperf -s -w 4000
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 3.91 KByte
------------------------------------------------------------
[852] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 51400
[ ID] Interval Transfer Bandwidth
[852] 0.0-10.1 sec 704 KBytes 570 Kbits/sec

Top of the page


Communication port (-p), timing (-t) and interval (-i):

The Iperf server communication port can be changed with the -p argument. It must be configured on the client and the server with the same value, default is TCP port 5001.
The -t argument specifies the test duration time in seconds, default is 10 secs.
The -i argument indicates the interval in seconds between periodic bandwidth reports.

Client side:

#iperf -c 10.1.1.1 -p 12000 -t 20 -i 2
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 12000
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 58316 connected with 10.1.1.1 port 12000
[ 3] 0.0- 2.0 sec 224 KBytes 918 Kbits/sec
[ 3] 2.0- 4.0 sec 368 KBytes 1.51 Mbits/sec
[ 3] 4.0- 6.0 sec 704 KBytes 2.88 Mbits/sec
[ 3] 6.0- 8.0 sec 280 KBytes 1.15 Mbits/sec
[ 3] 8.0-10.0 sec 208 KBytes 852 Kbits/sec
[ 3] 10.0-12.0 sec 344 KBytes 1.41 Mbits/sec
[ 3] 12.0-14.0 sec 208 KBytes 852 Kbits/sec
[ 3] 14.0-16.0 sec 232 KBytes 950 Kbits/sec
[ 3] 16.0-18.0 sec 232 KBytes 950 Kbits/sec
[ 3] 18.0-20.0 sec 264 KBytes 1.08 Mbits/sec
[ 3] 0.0-20.1 sec 3.00 MBytes 1.25 Mbits/sec

Server side:

#iperf -s -p 12000
------------------------------------------------------------
Server listening on TCP port 12000
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[852] local 10.1.1.1 port 12000 connected with 10.6.2.5 port 58316
[ ID] Interval Transfer Bandwidth
[852] 0.0-20.1 sec 3.00 MBytes 1.25 Mbits/sec

Top of the page


UDP tests: (-u), bandwidth settings (-b)
Also check the "Jperf" section.

The UDP tests with the -u argument will give invaluable information about the jitter and the packet loss. If you don't specify the -u argument, Iperf uses TCP.
To keep a good link quality, the packet loss should not go over 1 %. A high packet loss rate will generate a lot of TCP segment retransmissions which will affect the bandwidth.

The jitter is basically the latency variation and does not depend on the latency. You can have high response times and a very low jitter. The jitter value is particularly important on network links supporting voice over IP (VoIP) because a high jitter can break a call.
The -b argument allows the allocation if the desired bandwidth.

Client side:

#iperf -c 10.1.1.1 -u -b 10m
------------------------------------------------------------
Client connecting to 10.1.1.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 108 KByte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 32781 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.0 sec 11.8 MBytes 9.89 Mbits/sec
[ 3] Sent 8409 datagrams
[ 3] Server Report:
[ 3] 0.0-10.0 sec 11.8 MBytes 9.86 Mbits/sec 2.617 ms 9/ 8409 (0.11%)

Server side:

#iperf -s -u -i 1
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 8.00 KByte (default)
------------------------------------------------------------
[904] local 10.1.1.1 port 5001 connected with 10.6.2.5 port 32781
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[904] 0.0- 1.0 sec 1.17 MBytes 9.84 Mbits/sec 1.830 ms 0/ 837 (0%)
[904] 1.0- 2.0 sec 1.18 MBytes 9.94 Mbits/sec 1.846 ms 5/ 850 (0.59%)
[904] 2.0- 3.0 sec 1.19 MBytes 9.98 Mbits/sec 1.802 ms 2/ 851 (0.24%)
[904] 3.0- 4.0 sec 1.19 MBytes 10.0 Mbits/sec 1.830 ms 0/ 850 (0%)
[904] 4.0- 5.0 sec 1.19 MBytes 9.98 Mbits/sec 1.846 ms 1/ 850 (0.12%)
[904] 5.0- 6.0 sec 1.19 MBytes 10.0 Mbits/sec 1.806 ms 0/ 851 (0%)
[904] 6.0- 7.0 sec 1.06 MBytes 8.87 Mbits/sec 1.803 ms 1/ 755 (0.13%)
[904] 7.0- 8.0 sec 1.19 MBytes 10.0 Mbits/sec 1.831 ms 0/ 850 (0%)
[904] 8.0- 9.0 sec 1.19 MBytes 10.0 Mbits/sec 1.841 ms 0/ 850 (0%)
[904] 9.0-10.0 sec 1.19 MBytes 10.0 Mbits/sec 1.801 ms 0/ 851 (0%)
[904] 0.0-10.0 sec 11.8 MBytes 9.86 Mbits/sec 2.618 ms 9/ 8409 (0.11%)

Top of the page


Maximum Segment Size (-m argument) display:

The Maximum Segment Size (MSS) is the largest amount of data, in bytes, that a computer can support in a single, unfragmented TCP segment.
It can be calculated as follows:
MSS = MTU - TCP & IP headers
The TCP & IP headers are equal to 40 bytes.
The MTU or Maximum Transmission Unit is the greatest amount of data that can be transferred in a frame.
Here are some default MTU size for different network topology:
Ethernet - 1500 bytes: used in a LAN.
PPPoE - 1492 bytes: used on ADSL links.
Token Ring (16Mb/sec) - 17914 bytes: old technology developed by IBM.
Dial-up - 576 bytes

Generally, a higher MTU (and MSS) brings higher bandwidth efficiency

Client side:

#iperf -c 10.1.1.1 -m
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 41532 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.2 sec 1.27 MBytes 1.04 Mbits/sec
[ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet)

Here the MSS is not equal to 1500 - 40 but to 1500 - 40 - 12 (Timestamps option) = 1448

Server side:

#iperf -s
Top of the page


Maximum Segment Size (-M argument) settings:

Use the -M argument to change the MSS. (See the previous test for more explanations about the MSS)

#iperf -c 10.1.1.1 -M 1300 -m
WARNING: attempt to set TCP maximum segment size to 1300, but got 536
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 41533 connected with 10.1.1.1 port 5001
[ 3] 0.0-10.1 sec 4.29 MBytes 3.58 Mbits/sec
[ 3] MSS size 1288 bytes (MTU 1328 bytes, unknown interface)

Server side:

#iperf -s
Top of the page


Parallel tests (-P argument):

Use the -P argument to run parallel tests.

Client side:

#iperf -c 10.1.1.1 -P 2
------------------------------------------------------------
Client connecting to 10.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.6.2.5 port 41534 connected with 10.1.1.1 port 5001
[ 4] local 10.6.2.5 port 41535 connected with 10.1.1.1 port 5001
[ 4] 0.0-10.1 sec 1.35 MBytes 1.12 Mbits/sec
[ 3] 0.0-10.1 sec 1.35 MBytes 1.12 Mbits/sec
[SUM] 0.0-10.1 sec 2.70 MBytes 2.24 Mbits/sec

Server side:

#iperf -s
Top of the page


Iperf help:

#iperf -h
Usage: iperf [-s|-c host] [options]
iperf [-h|--help] [-v|--version]

Client/Server:
-f
-i
-l
-m
-p
-u
-w
-B
-C
-M
-N
-V --format
--interval
--len
--print_mss
--port
--udp
--window
--bind
--compatibility
--mss
--nodelay
--IPv6Version [kmKM]
#
#[KM]

#

#[KM]
"host"

#

format to report: Kbits, Mbits, KBytes, MBytes
seconds between periodic bandwidth reports
length of buffer to read or write (default 8 KB)
print TCP maximum segment size (MTU - TCP/IP header)
server port to listen on/connect to
use UDP rather than TCP
TCP window size (socket buffer size)
bind to "host", an interface or multicast address
for use with older versions does not sent extra msgs
set TCP maximum segment size (MTU - 40 bytes)
set TCP no delay, disabling Nagle's Algorithm
Set the domain to IPv6
Server specific:
-s
-U
-D --server
--single_udp
--daemon

run in server mode
run in single threaded UDP mode
run the server as a daemon
Client specific:
-b
-c
-d
-n
-r
-t
-F
-I
-L
-P
-T --bandwidth
--client
--dualtest
--num
--tradeoff
--time
--fileinput
--stdin
--listenport
--parallel
--ttl #[KM]
"host"

#[KM]

#
"name"

#
#
# for UDP, bandwidth to send at in bits/sec (default 1 Mbit/sec, implies -u)
run in client mode, connecting to "host"
Do a bidirectional test simultaneously
number of bytes to transmit (instead of -t)
Do a bidirectional test individually
time in seconds to transmit for (default 10 secs)
input the data to be transmitted from a file
input the data to be transmitted from stdin
port to recieve bidirectional tests back on
number of parallel client threads to run
time-to-live, for multicast (default 1)
Miscellaneous:
-h
-v --help
--version
print this message and quit
print version information and quit

Demystify Your Linux Box !!: vmxnet3 :A New Para-Virtualized NIC from Vmware

Demystify Your Linux Box !!: vmxnet3 :A New Para-Virtualized NIC from Vmware

vmxnet3 :A New Para-Virtualized NIC from Vmware

VMXNET3, the newest generation of virtual network adapter from VMware, offers performance on par with or better than its previous generations in both Windows and Linux guests. Both the driver and the device have been highly tuned to perform better on modern systems. Furthermore, VMXNET3 introduces new features and enhancements, such as TSO6 and RSS.

TSO6 makes it especially useful for users deploying applications that deal with IPv6 traffic, while RSS is helpful for deployments requiring high scalability. All these features give VMXNET3 advantages that are not possible with previous generations of virtual network adapters.
Moving forward, to keep pace with an ever‐increasing demand for network bandwidth, Vmware recommend customers migrate to VMXNET3 if performance is of top concern to their deployments.

The VMXNET3 driver is NAPI‐compliant on Linux guests. NAPI is an interrupt mitigation mechanism that improves high‐speed networking performance on Linux by switching back and forth between interrupt mode and polling mode during packet receive. It is a proven technique to improve CPU efficiency and allows the
guest to process higher packet loads. VMXNET3 also supports Large Receive Offload (LRO) on Linux guests.However, in ESX 4.0 the VMkernel backend supports large receive packets only if the packets originate from another virtual machine running on the same host.

VMXNET3 supports larger Tx/Rx ring buffer sizes compared to previous generations of virtual network devices. This feature benefits certain network workloads with bursty and high‐peak throughput. Having a larger ring size provides extra buffering to better cope with transient packet bursts.

VMXNET3 supports three interrupt modes:

MSI‐X,
MSI, and
INTx.

Normally the VMXNET3 guest driver will attempt to use the interrupt modes in the order given above, if the guest kernel supports them. With VMXNET3, TCP Segmentation Offload (TSO) for IPv6 is supported for both Windows and Linux guests now, and TSO support for IPv4 is added for Solaris guests in addition to Windows and Linux guests.

To use VMXNET3, the user must install VMware Tools on a virtual machine with hardware version 7.