1 .\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2 .\" All rights reserved.
4 .\" Redistribution and use in source and binary forms, with or without
5 .\" modification, are permitted provided that the following conditions
7 .\" 1. Redistributions of source code must retain the above copyright
8 .\" notice, this list of conditions and the following disclaimer.
9 .\" 2. Redistributions in binary form must reproduce the above copyright
10 .\" notice, this list of conditions and the following disclaimer in the
11 .\" documentation and/or other materials provided with the distribution.
13 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
25 .\" This document is derived in part from the enet man page (enet.4)
26 .\" distributed with 4.3BSD Unix.
35 .Nd a framework for fast packet I/O
40 is a framework for extremely fast and efficient packet I/O
41 for userspace and kernel clients, and for Virtual Machines.
44 Linux and some versions of Windows, and supports a variety of
48 .It Nm physical NIC ports
49 to access individual queues of network interfaces;
51 to inject packets into the host stack;
53 implementing a very fast and modular in-kernel software switch/dataplane;
55 a shared memory packet transport channel;
56 .It Nm netmap monitors
57 a mechanism similar to
64 are accessed interchangeably with the same API,
65 and are at least one order of magnitude faster than
66 standard OS mechanisms
67 (sockets, bpf, tun/tap interfaces, native switches, pipes).
68 With suitably fast hardware (NICs, PCIe buses, CPUs),
72 reaches 14.88 million packets per second (Mpps)
73 with much less than one core on 10 Gbit/s NICs;
74 35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
75 about 20 Mpps per core for VALE ports;
80 support can still use the API in emulated mode,
81 which uses unmodified device drivers and is 3-5 times faster than
85 Userspace clients can dynamically switch NICs into
87 mode and send and receive raw packets through
88 memory mapped buffers.
91 switch instances and ports,
95 can be created dynamically,
96 providing high speed packet I/O between processes,
97 virtual machines, NICs and the host stack.
100 supports both non-blocking I/O through
102 synchronization and blocking I/O through a file descriptor
103 and standard OS mechanisms such as
113 are implemented by a single kernel module, which also emulates the
115 API over standard drivers.
116 For best performance,
118 requires native support in device drivers.
119 A list of such devices is at the end of this document.
121 In the rest of this (long) manual page we document
122 various aspects of the
126 architecture, features and usage.
129 supports raw packet I/O through a
131 which can be connected to a physical interface
137 Ports use preallocated circular queues of buffers
139 residing in an mmapped region.
140 There is one ring for each transmit/receive queue of a
142 An additional ring pair connects to the host stack.
144 After binding a file descriptor to a port, a
146 client can send or receive packets in batches through
147 the rings, and possibly implement zero-copy forwarding
150 All NICs operating in
152 mode use the same memory region,
153 accessible to all processes who own
155 file descriptors bound to NICs.
161 by default use separate memory regions,
162 but can be independently configured to share memory.
163 .Sh ENTERING AND EXITING NETMAP MODE
164 The following section describes the system calls to create
172 Simpler, higher level functions are described in the
176 Ports and rings are created and controlled through a file descriptor,
177 created by opening a special device
178 .Dl fd = open("/dev/netmap");
179 and then bound to a specific port with an
180 .Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
183 has multiple modes of operation controlled by the
187 specifies the netmap port name, as follows:
189 .It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
190 the data path of the NIC is disconnected from the host stack,
191 and the file descriptor is bound to the NIC (one or all queues),
192 or to the host stack;
194 the file descriptor is bound to port PPP of VALE switch SSS.
195 Switch instances and ports are dynamically created if necessary.
197 Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
198 cannot exceed IFNAMSIZ characters, and PPP cannot
199 be the name of any existing OS network interface.
204 indicates the size of the shared memory region,
205 and the number, size and location of all the
207 data structures, which can be accessed by mmapping the memory
208 .Dl char *mem = mmap(0, arg.nr_memsize, fd);
210 Non-blocking I/O is done with special
215 on the file descriptor permit blocking I/O.
219 mode, the OS will still believe the interface is up and running.
220 OS-generated packets for that NIC end up into a
222 ring, and another ring is used to send packets into the OS network stack.
225 on the file descriptor removes the binding,
226 and returns the NIC to normal mode (reconnecting the data path
227 to the host stack), or destroys the virtual port.
229 The data structures in the mmapped memory region are detailed in
230 .In sys/net/netmap.h ,
231 which is the ultimate reference for the
234 The main structures and fields are indicated below:
236 .It Dv struct netmap_if (one per interface )
240 const uint32_t ni_flags; /* properties */
242 const uint32_t ni_tx_rings; /* NIC tx rings */
243 const uint32_t ni_rx_rings; /* NIC rx rings */
244 uint32_t ni_bufs_head; /* head of extra bufs list */
249 Indicates the number of available rings
250 .Pa ( struct netmap_rings )
251 and their position in the mmapped region.
252 The number of tx and rx rings
253 .Pa ( ni_tx_rings , ni_rx_rings )
254 normally depends on the hardware.
255 NICs also have an extra tx/rx ring pair connected to the host stack.
257 can also request additional unbound buffers in the same memory space,
258 to be used as temporary storage for packets.
260 buffers is specified in the
263 On success, the kernel writes back to
265 the number of extra buffers actually allocated (they may be less
266 than the amount requested if the memory space ran out of buffers).
268 contains the index of the first of these extra buffers,
269 which are connected in a list (the first uint32_t of each
270 buffer being the index of the next buffer in the list).
273 indicates the end of the list.
274 The application is free to modify
275 this list and use the buffers (i.e., binding them to the slots of a
277 When closing the netmap file descriptor,
278 the kernel frees the buffers contained in the list pointed by
280 , irrespectively of the buffers originally provided by the kernel on
282 .It Dv struct netmap_ring (one per ring )
286 const uint32_t num_slots; /* slots in each ring */
287 const uint32_t nr_buf_size; /* size of each buffer */
289 uint32_t head; /* (u) first buf owned by user */
290 uint32_t cur; /* (u) wakeup position */
291 const uint32_t tail; /* (k) first buf owned by kernel */
294 struct timeval ts; /* (k) time of last rxsync() */
296 struct netmap_slot slot[0]; /* array of slots */
300 Implements transmit and receive rings, with read/write
301 pointers, metadata and an array of
303 describing the buffers.
304 .It Dv struct netmap_slot (one per buffer )
307 uint32_t buf_idx; /* buffer index */
308 uint16_t len; /* packet length */
309 uint16_t flags; /* buf changed, etc. */
310 uint64_t ptr; /* address for indirect buffers */
314 Describes a packet buffer, which normally is identified by
315 an index and resides in the mmapped region.
316 .It Dv packet buffers
317 Fixed size (normally 2 KB) packet buffers allocated by the kernel.
322 in the mmapped region is indicated by the
324 field in the structure returned by
326 From there, all other objects are reachable through
327 relative references (offsets or indexes).
328 Macros and functions in
329 .In net/netmap_user.h
330 help converting them into actual pointers:
332 .Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset);
333 .Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
334 .Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
336 .Dl char *buf = NETMAP_BUF(ring, buffer_index);
337 .Sh RINGS, BUFFERS AND DATA I/O
339 are circular queues of packets with three indexes/pointers
340 .Va ( head , cur , tail ) ;
341 one slot is always kept empty.
344 should not be assumed to be a power of two.
347 is the first slot available to userspace;
351 select/poll will unblock when
357 is the first slot reserved to the kernel.
362 for convenience, the function
363 .Dl nm_ring_next(ring, index)
364 returns the next index modulo the ring size.
369 are only modified by the user program;
371 is only modified by the kernel.
372 The kernel only reads/writes the
373 .Vt struct netmap_ring
375 during the execution of a netmap-related system call.
376 The only exception are slots (and buffers) in the range
377 .Va tail\ . . . head-1 ,
378 that are explicitly assigned to the kernel.
380 On transmit rings, after a
382 system call, slots in the range
383 .Va head\ . . . tail-1
384 are available for transmission.
385 User code should fill the slots sequentially
390 past slots ready to transmit.
392 may be moved further ahead if the user code needs
393 more slots before further transmissions (see
394 .Sx SCATTER GATHER I/O ) .
396 At the next NIOCTXSYNC/select()/poll(),
399 are pushed to the port, and
401 may advance if further slots have become available.
402 Below is an example of the evolution of a TX ring:
404 after the syscall, slots between cur and tail are (a)vailable
408 TX [.....aaaaaaaaaaa.............]
410 user creates new packets to (T)ransmit
414 TX [.....TTTTTaaaaaa.............]
416 NIOCTXSYNC/poll()/select() sends packets and reports new slots
420 TX [..........aaaaaaaaaaa........]
426 will block if there is no space in the ring, i.e.,
427 .Dl ring->cur == ring->tail
428 and return when new slots have become available.
430 High speed applications may want to amortize the cost of system calls
431 by preparing as many packets as possible before issuing them.
433 A transmit ring with pending transmissions has
434 .Dl ring->head != ring->tail + 1 (modulo the ring size).
436 .Va int nm_tx_pending(ring)
437 implements this test.
439 On receive rings, after a
441 system call, the slots in the range
442 .Va head\& . . . tail-1
443 contain received packets.
444 User code should process them and advance
448 past slots it wants to return to the kernel.
450 may be moved further ahead if the user code wants to
451 wait for more packets
452 without returning all the previous slots to the kernel.
454 At the next NIOCRXSYNC/select()/poll(),
457 are returned to the kernel for further receives, and
459 may advance to report new incoming packets.
461 Below is an example of the evolution of an RX ring:
463 after the syscall, there are some (h)eld and some (R)eceived slots
467 RX [..hhhhhhRRRRRRRR..........]
469 user advances head and cur, releasing some slots and holding others
473 RX [..*****hhhRRRRRR...........]
475 NICRXSYNC/poll()/select() recovers slots and reports new packets
479 RX [.......hhhRRRRRRRRRRRR....]
481 .Sh SLOTS AND PACKET BUFFERS
482 Normally, packets should be stored in the netmap-allocated buffers
483 assigned to slots when ports are bound to a file descriptor.
484 One packet is fully contained in a single buffer.
486 The following flags affect slot and buffer processing:
492 in the slot is changed.
493 This can be used to implement
494 zero-copy forwarding, see
495 .Sx ZERO-COPY FORWARDING .
497 reports when this buffer has been transmitted.
500 notifies transmit completions in batches, hence signals
501 can be delayed indefinitely.
502 This flag helps detect
503 when packets have been sent and a file descriptor can be closed.
505 When a ring is in 'transparent' mode,
506 packets marked with this flag by the user application are forwarded to the
507 other endpoint at the next system call, thus restoring (in a selective way)
508 the connection between a NIC and the host stack.
510 tells the forwarding code that the source MAC address for this
511 packet must not be used in the learning bridge code.
513 indicates that the packet's payload is in a user-supplied buffer
514 whose user virtual address is in the 'ptr' field of the slot.
515 The size can reach 65535 bytes.
517 This is only supported on the transmit ring of
519 ports, and it helps reducing data copies in the interconnection
522 indicates that the packet continues with subsequent buffers;
523 the last buffer in a packet must have the flag clear.
525 .Sh SCATTER GATHER I/O
526 Packets can span multiple slots if the
528 flag is set in all but the last slot.
529 The maximum length of a chain is 64 buffers.
530 This is normally used with
532 ports when connecting virtual machines, as they generate large
533 TSO segments that are not split unless they reach a physical device.
535 NOTE: The length field always refers to the individual
536 fragment; there is no place with the total length of a packet.
538 On receive rings the macro
540 indicates the remaining number of slots for this packet,
541 including the current one.
542 Slots with a value greater than 1 also have NS_MOREFRAG set.
545 uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
546 for non-blocking I/O.
547 They take no argument.
548 Two more ioctls (NIOCGINFO, NIOCREGIF) are used
549 to query and configure ports, with the following argument:
552 char nr_name[IFNAMSIZ]; /* (i) port name */
553 uint32_t nr_version; /* (i) API version */
554 uint32_t nr_offset; /* (o) nifp offset in mmap region */
555 uint32_t nr_memsize; /* (o) size of the mmap region */
556 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
557 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
558 uint16_t nr_tx_rings; /* (i/o) number of tx rings */
559 uint16_t nr_rx_rings; /* (i/o) number of rx rings */
560 uint16_t nr_ringid; /* (i/o) ring(s) we care about */
561 uint16_t nr_cmd; /* (i) special command */
562 uint16_t nr_arg1; /* (i/o) extra arguments */
563 uint16_t nr_arg2; /* (i/o) extra arguments */
564 uint32_t nr_arg3; /* (i/o) extra arguments */
565 uint32_t nr_flags /* (i/o) open mode */
570 A file descriptor obtained through
572 also supports the ioctl supported by network devices, see
576 returns EINVAL if the named port does not support netmap.
577 Otherwise, it returns 0 and (advisory) information
579 Note that all the information below can change before the
580 interface is actually put in netmap mode.
583 indicates the size of the
588 mode all share the same memory region,
591 ports have independent regions for each port.
592 .It Pa nr_tx_slots , nr_rx_slots
593 indicate the size of transmit and receive rings.
594 .It Pa nr_tx_rings , nr_rx_rings
595 indicate the number of transmit
597 Both ring number and sizes may be configured at runtime
598 using interface-specific functions (e.g.,
603 binds the port named in
605 to the file descriptor.
606 For a physical device this also switches it into
609 it from the host stack.
610 Multiple file descriptors can be bound to the same port,
611 with proper synchronization left to the user.
613 The recommended way to bind a file descriptor to a port is
618 which parses names to access specific port types and
620 In the following we document the main features.
622 .Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
624 consisting of two netmap ports with a crossover connection.
625 A netmap pipe share the same memory space of the parent port,
626 and is meant to enable configuration where a master process acts
627 as a dispatcher towards slave processes.
629 To enable this function, the
631 field of the structure can be used as a hint to the kernel to
632 indicate how many pipes we expect to use, and reserve extra space
633 in the memory region.
635 On return, it gives the same info as NIOCGINFO,
640 indicating the identity of the rings controlled through the file
645 selects which rings are controlled through this file descriptor.
648 are indicated below, together with the naming schemes
649 that application libraries (such as the
651 indicated below) can use to indicate the specific set of rings.
652 In the example below, "netmap:foo" is any valid netmap port name.
653 .Bl -tag -width XXXXX
654 .It NR_REG_ALL_NIC "netmap:foo"
655 (default) all hardware ring pairs
656 .It NR_REG_SW "netmap:foo^"
657 the ``host rings'', connecting to the host stack.
658 .It NR_REG_NIC_SW "netmap:foo+"
659 all hardware rings and the host rings
660 .It NR_REG_ONE_NIC "netmap:foo-i"
661 only the i-th hardware ring pair, where the number is in
663 .It NR_REG_PIPE_MASTER "netmap:foo{i"
664 the master side of the netmap pipe whose identifier (i) is in
666 .It NR_REG_PIPE_SLAVE "netmap:foo}i"
667 the slave side of the netmap pipe whose identifier (i) is in
670 The identifier of a pipe must be thought as part of the pipe name,
671 and does not need to be sequential.
673 will only have a single ring pair with index 0,
674 irrespective of the value of
682 call pushes out any pending packets on the transmit ring, even if
683 no write events are specified.
684 The feature can be disabled by or-ing
685 .Va NETMAP_NO_TX_POLL
686 to the value written to
688 When this feature is used,
689 packets are transmitted only on
690 .Va ioctl(NIOCTXSYNC)
694 are called with a write event (POLLOUT/wfdset) or a full ring.
696 When registering a virtual interface that is dynamically created to a
698 switch, we can specify the desired number of rings (1 by default,
699 and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
701 tells the hardware of new packets to transmit, and updates the
702 number of slots available for transmission.
704 tells the hardware of consumed packets, and asks for newly available
707 .Sh SELECT, POLL, EPOLL, KQUEUE
713 file descriptor process rings as indicated in
717 respectively when write (POLLOUT) and read (POLLIN) events are requested.
718 Both block if no slots are available in the ring
719 .Va ( ring->cur == ring->tail ) .
720 Depending on the platform,
726 Packets in transmit rings are normally pushed out
727 (and buffers reclaimed) even without
728 requesting write events.
730 .Dv NETMAP_NO_TX_POLL
733 disables this feature.
734 By default, receive rings are processed only if read
735 events are requested.
737 .Dv NETMAP_DO_RX_POLL
739 .Em NIOCREGIF updates receive rings even without read events.
744 .Dv NETMAP_NO_TX_POLL
746 .Dv NETMAP_DO_RX_POLL
747 only have an effect when some event is posted for the file descriptor.
751 API is supposed to be used directly, both because of its simplicity and
752 for efficient integration with applications.
755 .In net/netmap_user.h
756 header provides a few macros and functions to ease creating
757 a file descriptor and doing I/O with a
760 These are loosely modeled after the
762 API, to ease porting of libpcap-based applications to
764 To use these extra functions, programs should
765 .Dl #define NETMAP_WITH_LIBS
767 .Dl #include <net/netmap_user.h>
769 The following functions are available:
770 .Bl -tag -width XXXXX
771 .It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
773 .Xr pcap_open_live 3 ,
774 binds a file descriptor to a port.
777 is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
781 provides the initial values for the argument to the NIOCREGIF ioctl.
782 The nm_flags and nm_ringid values are overwritten by parsing
783 ifname and flags, and other fields can be overridden through
784 the other two arguments.
786 points to a struct nm_desc containing arguments (e.g., from a previously
787 open file descriptor) that should override the defaults.
788 The fields are used as described below
790 can be set to a combination of the following flags:
791 .Va NETMAP_NO_TX_POLL ,
792 .Va NETMAP_DO_RX_POLL
793 (copied into nr_ringid);
795 (if arg points to the same memory region,
796 avoids the mmap and uses the values from it);
798 (ignores ifname and uses the values in arg);
802 (uses the fields from arg);
804 (uses the ring number and sizes from arg).
806 .It Va int nm_close(struct nm_desc *d )
807 closes the file descriptor, unmaps memory, frees resources.
808 .It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
811 pushes a packet to a ring, returns the size
812 of the packet is successful, or 0 on error;
813 .It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
815 .Va pcap_dispatch() ,
816 applies a callback to incoming packets
817 .It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
820 fetches the next packet
822 .Sh SUPPORTED DEVICES
824 natively supports the following devices:
831 (providing igb, em and lem),
837 On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
839 NICs without native support can still be used in
841 mode through emulation.
842 Performance is inferior to native netmap
843 mode but still significantly higher than various raw socket types
844 (bpf, PF_PACKET, etc.).
845 Note that for slow devices (such as 1 Gbit/s and slower NICs,
846 or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
847 emulated and native mode will likely have similar or same throughput.
849 When emulation is in use, packet sniffer programs such as tcpdump
850 could see received packets before they are diverted by netmap.
851 This behaviour is not intentional, being just an artifact of the implementation
853 Note that in case the netmap application subsequently moves packets received
854 from the emulated adapter onto the host RX ring, the sniffer will intercept
855 those packets again, since the packets are injected to the host stack as they
856 were received by the network interface.
858 Emulation is also available for devices with native netmap support,
859 which can be used for testing or performance comparison.
861 .Va dev.netmap.admode
862 globally controls how netmap mode is implemented.
863 .Sh SYSCTL VARIABLES AND MODULE PARAMETERS
864 Some aspect of the operation of
866 are controlled through sysctl variables on
869 and module parameters on Linux
870 .Em ( /sys/module/netmap/parameters/* ) :
871 .Bl -tag -width indent
872 .It Va dev.netmap.admode: 0
873 Controls the use of native or emulated adapter mode.
875 0 uses the best available option;
877 1 forces native mode and fails if not available;
879 2 forces emulated hence never fails.
880 .It Va dev.netmap.generic_rings: 1
881 Number of rings used for emulated netmap mode
882 .It Va dev.netmap.generic_ringsize: 1024
883 Ring size used for emulated netmap mode
884 .It Va dev.netmap.generic_mit: 100000
885 Controls interrupt moderation for emulated mode
886 .It Va dev.netmap.mmap_unreg: 0
887 .It Va dev.netmap.fwd: 0
888 Forces NS_FORWARD mode
889 .It Va dev.netmap.flags: 0
890 .It Va dev.netmap.txsync_retry: 2
891 .It Va dev.netmap.no_pendintr: 1
892 Forces recovery of transmit buffers on system calls
893 .It Va dev.netmap.mitigate: 1
894 Propagates interrupt mitigation to user processes
895 .It Va dev.netmap.no_timestamp: 0
896 Disables the update of the timestamp in the netmap ring
897 .It Va dev.netmap.verbose: 0
898 Verbose kernel messages
899 .It Va dev.netmap.buf_num: 163840
900 .It Va dev.netmap.buf_size: 2048
901 .It Va dev.netmap.ring_num: 200
902 .It Va dev.netmap.ring_size: 36864
903 .It Va dev.netmap.if_num: 100
904 .It Va dev.netmap.if_size: 1024
905 Sizes and number of objects (netmap_if, netmap_ring, buffers)
906 for the global memory region.
907 The only parameter worth modifying is
908 .Va dev.netmap.buf_num
909 as it impacts the total amount of memory used by netmap.
910 .It Va dev.netmap.buf_curr_num: 0
911 .It Va dev.netmap.buf_curr_size: 0
912 .It Va dev.netmap.ring_curr_num: 0
913 .It Va dev.netmap.ring_curr_size: 0
914 .It Va dev.netmap.if_curr_num: 0
915 .It Va dev.netmap.if_curr_size: 0
916 Actual values in use.
917 .It Va dev.netmap.bridge_batch: 1024
918 Batch size used when moving packets across a
921 Values above 64 generally guarantee good
923 .It Va dev.netmap.ptnet_vnet_hdr: 1
924 Allow ptnet devices to use virtio-net headers
934 to wake up processes when significant events occur, and
938 is used to configure ports and
941 Applications may need to create threads and bind them to
942 specific cores to improve performance, using standard
946 .Xr pthread_setaffinity_np 3
951 comes with a few programs that can be used for testing or
958 .Pa tools/tools/netmap/
964 is a general purpose traffic source/sink.
967 .Dl pkt-gen -i ix0 -f tx -l 60
968 can generate an infinite stream of minimum size packets, and
969 .Dl pkt-gen -i ix0 -f rx
971 Both print traffic statistics, to help monitor
972 how the system performs.
975 has many options can be uses to set packet sizes, addresses,
976 rates, and use multiple send/receive threads and cores.
979 is another test program which interconnects two
982 It can be used for transparent forwarding between
984 .Dl bridge -i netmap:ix0 -i netmap:ix1
985 or even connect the NIC to the host stack using netmap
986 .Dl bridge -i netmap:ix0
987 .Ss USING THE NATIVE API
988 The following code implements a traffic generator
990 .Bd -literal -compact
991 #include <net/netmap_user.h>
995 struct netmap_if *nifp;
996 struct netmap_ring *ring;
1000 fd = open("/dev/netmap", O_RDWR);
1001 bzero(&nmr, sizeof(nmr));
1002 strcpy(nmr.nr_name, "ix0");
1003 nmr.nm_version = NETMAP_API;
1004 ioctl(fd, NIOCREGIF, &nmr);
1005 p = mmap(0, nmr.nr_memsize, fd);
1006 nifp = NETMAP_IF(p, nmr.nr_offset);
1007 ring = NETMAP_TXRING(nifp, 0);
1009 fds.events = POLLOUT;
1012 while (!nm_ring_empty(ring)) {
1014 buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
1015 ... prepare packet in buf ...
1016 ring->slot[i].len = ... packet length ...
1017 ring->head = ring->cur = nm_ring_next(ring, i);
1022 .Ss HELPER FUNCTIONS
1023 A simple receiver can be implemented using the helper functions
1024 .Bd -literal -compact
1025 #define NETMAP_WITH_LIBS
1026 #include <net/netmap_user.h>
1035 d = nm_open("netmap:ix0", NULL, 0, 0);
1036 fds.fd = NETMAP_FD(d);
1037 fds.events = POLLIN;
1040 while ( (buf = nm_nextpkt(d, &h)) )
1041 consume_pkt(buf, h->len);
1046 .Ss ZERO-COPY FORWARDING
1047 Since physical interfaces share the same memory region,
1048 it is possible to do packet forwarding between ports
1050 The buffer from the transmit ring is used
1051 to replenish the receive ring:
1052 .Bd -literal -compact
1054 struct netmap_slot *src, *dst;
1056 src = &src_ring->slot[rxr->cur];
1057 dst = &dst_ring->slot[txr->cur];
1059 dst->buf_idx = src->buf_idx;
1060 dst->len = src->len;
1061 dst->flags = NS_BUF_CHANGED;
1063 src->flags = NS_BUF_CHANGED;
1064 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1065 txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1068 .Ss ACCESSING THE HOST STACK
1069 The host stack is for all practical purposes just a regular ring pair,
1070 which you can access with the netmap API (e.g., with
1071 .Dl nm_open("netmap:eth0^", ... ) ;
1072 All packets that the host would send to an interface in
1074 mode end up into the RX ring, whereas all packets queued to the
1075 TX ring are send up to the host stack.
1077 A simple way to test the performance of a
1079 switch is to attach a sender and a receiver to it,
1080 e.g., running the following in two different terminals:
1081 .Dl pkt-gen -i vale1:a -f rx # receiver
1082 .Dl pkt-gen -i vale1:b -f tx # sender
1083 The same example can be used to test netmap pipes, by simply
1084 changing port names, e.g.,
1085 .Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1086 .Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1088 The following command attaches an interface and the host stack
1090 .Dl valectl -h vale2:em0
1093 clients attached to the same switch can now communicate
1094 with the network card or the host.
1103 .Pa http://info.iet.unipi.it/~luigi/netmap/
1105 Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1106 Communications of the ACM, 55 (3), pp.45-51, March 2012
1108 Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1109 Usenix ATC'12, June 2012, Boston
1111 Luigi Rizzo, Giuseppe Lettieri,
1112 VALE, a switched ethernet for virtual machines,
1113 ACM CoNEXT'12, December 2012, Nice
1115 Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1116 Speeding up packet I/O in virtual machines,
1117 ACM/IEEE ANCS'13, October 2013, San Jose
1122 framework has been originally designed and implemented at the
1123 Universita` di Pisa in 2011 by
1125 and further extended with help from
1127 .An Gaetano Catalli ,
1128 .An Giuseppe Lettieri ,
1130 .An Vincenzo Maffione .
1135 have been funded by the European Commission within FP7 Projects
1136 CHANGE (257422) and OPENLAB (287581).
1138 No matter how fast the CPU and OS are,
1139 achieving line rate on 10G and faster interfaces
1140 requires hardware with sufficient performance.
1141 Several NICs are unable to sustain line rate with
1143 Insufficient PCIe or memory bandwidth
1144 can also cause reduced performance.
1146 Another frequent reason for low performance is the use
1147 of flow control on the link: a slow receiver can limit
1149 Be sure to disable flow control when running high
1151 .Ss SPECIAL NIC FEATURES
1153 is orthogonal to some NIC features such as
1154 multiqueue, schedulers, packet filters.
1156 Multiple transmit and receive rings are supported natively
1157 and can be configured with ordinary OS tools,
1161 device-specific sysctl variables.
1162 The same goes for Receive Packet Steering (RPS)
1163 and filtering of incoming traffic.
1168 .Em checksum offloading , TCP segmentation offloading ,
1169 .Em encryption , VLAN encapsulation/decapsulation ,
1171 When using netmap to exchange packets with the host stack,
1172 make sure to disable these features.