1 .\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
2 .\" All rights reserved.
4 .\" Redistribution and use in source and binary forms, with or without
5 .\" modification, are permitted provided that the following conditions
7 .\" 1. Redistributions of source code must retain the above copyright
8 .\" notice, this list of conditions and the following disclaimer.
9 .\" 2. Redistributions in binary form must reproduce the above copyright
10 .\" notice, this list of conditions and the following disclaimer in the
11 .\" documentation and/or other materials provided with the distribution.
13 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
25 .\" This document is derived in part from the enet man page (enet.4)
26 .\" distributed with 4.3BSD Unix.
35 .Nd a framework for fast packet I/O
40 is a framework for extremely fast and efficient packet I/O
41 for userspace and kernel clients, and for Virtual Machines.
44 Linux and some versions of Windows, and supports a variety of
48 .It Nm physical NIC ports
49 to access individual queues of network interfaces;
51 to inject packets into the host stack;
53 implementing a very fast and modular in-kernel software switch/dataplane;
55 a shared memory packet transport channel;
56 .It Nm netmap monitors
57 a mechanism similar to
64 are accessed interchangeably with the same API,
65 and are at least one order of magnitude faster than
66 standard OS mechanisms
67 (sockets, bpf, tun/tap interfaces, native switches, pipes).
68 With suitably fast hardware (NICs, PCIe buses, CPUs),
72 reaches 14.88 million packets per second (Mpps)
73 with much less than one core on 10 Gbit/s NICs;
74 35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
75 about 20 Mpps per core for VALE ports;
80 support can still use the API in emulated mode,
81 which uses unmodified device drivers and is 3-5 times faster than
85 Userspace clients can dynamically switch NICs into
87 mode and send and receive raw packets through
88 memory mapped buffers.
91 switch instances and ports,
95 can be created dynamically,
96 providing high speed packet I/O between processes,
97 virtual machines, NICs and the host stack.
100 supports both non-blocking I/O through
102 synchronization and blocking I/O through a file descriptor
103 and standard OS mechanisms such as
113 are implemented by a single kernel module, which also emulates the
115 API over standard drivers.
116 For best performance,
118 requires native support in device drivers.
119 A list of such devices is at the end of this document.
121 In the rest of this (long) manual page we document
122 various aspects of the
126 architecture, features and usage.
129 supports raw packet I/O through a
131 which can be connected to a physical interface
137 Ports use preallocated circular queues of buffers
139 residing in an mmapped region.
140 There is one ring for each transmit/receive queue of a
142 An additional ring pair connects to the host stack.
144 After binding a file descriptor to a port, a
146 client can send or receive packets in batches through
147 the rings, and possibly implement zero-copy forwarding
150 All NICs operating in
152 mode use the same memory region,
153 accessible to all processes who own
155 file descriptors bound to NICs.
161 by default use separate memory regions,
162 but can be independently configured to share memory.
163 .Sh ENTERING AND EXITING NETMAP MODE
164 The following section describes the system calls to create
172 Simpler, higher level functions are described in the
176 Ports and rings are created and controlled through a file descriptor,
177 created by opening a special device
178 .Dl fd = open("/dev/netmap");
179 and then bound to a specific port with an
180 .Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
183 has multiple modes of operation controlled by the
187 specifies the netmap port name, as follows:
189 .It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
190 the data path of the NIC is disconnected from the host stack,
191 and the file descriptor is bound to the NIC (one or all queues),
192 or to the host stack;
194 the file descriptor is bound to port PPP of VALE switch SSS.
195 Switch instances and ports are dynamically created if necessary.
197 Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
198 cannot exceed IFNAMSIZ characters, and PPP cannot
199 be the name of any existing OS network interface.
204 indicates the size of the shared memory region,
205 and the number, size and location of all the
207 data structures, which can be accessed by mmapping the memory
208 .Dl char *mem = mmap(0, arg.nr_memsize, fd);
210 Non-blocking I/O is done with special
215 on the file descriptor permit blocking I/O.
219 mode, the OS will still believe the interface is up and running.
220 OS-generated packets for that NIC end up into a
222 ring, and another ring is used to send packets into the OS network stack.
225 on the file descriptor removes the binding,
226 and returns the NIC to normal mode (reconnecting the data path
227 to the host stack), or destroys the virtual port.
229 The data structures in the mmapped memory region are detailed in
230 .In sys/net/netmap.h ,
231 which is the ultimate reference for the
234 The main structures and fields are indicated below:
236 .It Dv struct netmap_if (one per interface )
240 const uint32_t ni_flags; /* properties */
242 const uint32_t ni_tx_rings; /* NIC tx rings */
243 const uint32_t ni_rx_rings; /* NIC rx rings */
244 uint32_t ni_bufs_head; /* head of extra bufs list */
249 Indicates the number of available rings
250 .Pa ( struct netmap_rings )
251 and their position in the mmapped region.
252 The number of tx and rx rings
253 .Pa ( ni_tx_rings , ni_rx_rings )
254 normally depends on the hardware.
255 NICs also have an extra tx/rx ring pair connected to the host stack.
257 can also request additional unbound buffers in the same memory space,
258 to be used as temporary storage for packets.
260 buffers is specified in the
263 On success, the kernel writes back to
265 the number of extra buffers actually allocated (they may be less
266 than the amount requested if the memory space ran out of buffers).
268 contains the index of the first of these extra buffers,
269 which are connected in a list (the first uint32_t of each
270 buffer being the index of the next buffer in the list).
273 indicates the end of the list.
274 The application is free to modify
275 this list and use the buffers (i.e., binding them to the slots of a
277 When closing the netmap file descriptor,
278 the kernel frees the buffers contained in the list pointed by
280 , irrespectively of the buffers originally provided by the kernel on
282 .It Dv struct netmap_ring (one per ring )
286 const uint32_t num_slots; /* slots in each ring */
287 const uint32_t nr_buf_size; /* size of each buffer */
289 uint32_t head; /* (u) first buf owned by user */
290 uint32_t cur; /* (u) wakeup position */
291 const uint32_t tail; /* (k) first buf owned by kernel */
294 struct timeval ts; /* (k) time of last rxsync() */
296 struct netmap_slot slot[0]; /* array of slots */
300 Implements transmit and receive rings, with read/write
301 pointers, metadata and an array of
303 describing the buffers.
304 .It Dv struct netmap_slot (one per buffer )
307 uint32_t buf_idx; /* buffer index */
308 uint16_t len; /* packet length */
309 uint16_t flags; /* buf changed, etc. */
310 uint64_t ptr; /* address for indirect buffers */
314 Describes a packet buffer, which normally is identified by
315 an index and resides in the mmapped region.
316 .It Dv packet buffers
317 Fixed size (normally 2 KB) packet buffers allocated by the kernel.
322 in the mmapped region is indicated by the
324 field in the structure returned by
326 From there, all other objects are reachable through
327 relative references (offsets or indexes).
328 Macros and functions in
329 .In net/netmap_user.h
330 help converting them into actual pointers:
332 .Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset);
333 .Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
334 .Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
336 .Dl char *buf = NETMAP_BUF(ring, buffer_index);
337 .Sh RINGS, BUFFERS AND DATA I/O
339 are circular queues of packets with three indexes/pointers
340 .Va ( head , cur , tail ) ;
341 one slot is always kept empty.
344 should not be assumed to be a power of two.
347 is the first slot available to userspace;
351 select/poll will unblock when
357 is the first slot reserved to the kernel.
362 for convenience, the function
363 .Dl nm_ring_next(ring, index)
364 returns the next index modulo the ring size.
369 are only modified by the user program;
371 is only modified by the kernel.
372 The kernel only reads/writes the
373 .Vt struct netmap_ring
375 during the execution of a netmap-related system call.
376 The only exception are slots (and buffers) in the range
377 .Va tail\ . . . head-1 ,
378 that are explicitly assigned to the kernel.
380 On transmit rings, after a
382 system call, slots in the range
383 .Va head\ . . . tail-1
384 are available for transmission.
385 User code should fill the slots sequentially
390 past slots ready to transmit.
392 may be moved further ahead if the user code needs
393 more slots before further transmissions (see
394 .Sx SCATTER GATHER I/O ) .
396 At the next NIOCTXSYNC/select()/poll(),
399 are pushed to the port, and
401 may advance if further slots have become available.
402 Below is an example of the evolution of a TX ring:
404 after the syscall, slots between cur and tail are (a)vailable
408 TX [.....aaaaaaaaaaa.............]
410 user creates new packets to (T)ransmit
414 TX [.....TTTTTaaaaaa.............]
416 NIOCTXSYNC/poll()/select() sends packets and reports new slots
420 TX [..........aaaaaaaaaaa........]
426 will block if there is no space in the ring, i.e.,
427 .Dl ring->cur == ring->tail
428 and return when new slots have become available.
430 High speed applications may want to amortize the cost of system calls
431 by preparing as many packets as possible before issuing them.
433 A transmit ring with pending transmissions has
434 .Dl ring->head != ring->tail + 1 (modulo the ring size).
436 .Va int nm_tx_pending(ring)
437 implements this test.
439 On receive rings, after a
441 system call, the slots in the range
442 .Va head\& . . . tail-1
443 contain received packets.
444 User code should process them and advance
448 past slots it wants to return to the kernel.
450 may be moved further ahead if the user code wants to
451 wait for more packets
452 without returning all the previous slots to the kernel.
454 At the next NIOCRXSYNC/select()/poll(),
457 are returned to the kernel for further receives, and
459 may advance to report new incoming packets.
461 Below is an example of the evolution of an RX ring:
463 after the syscall, there are some (h)eld and some (R)eceived slots
467 RX [..hhhhhhRRRRRRRR..........]
469 user advances head and cur, releasing some slots and holding others
473 RX [..*****hhhRRRRRR...........]
475 NICRXSYNC/poll()/select() recovers slots and reports new packets
479 RX [.......hhhRRRRRRRRRRRR....]
481 .Sh SLOTS AND PACKET BUFFERS
482 Normally, packets should be stored in the netmap-allocated buffers
483 assigned to slots when ports are bound to a file descriptor.
484 One packet is fully contained in a single buffer.
486 The following flags affect slot and buffer processing:
492 in the slot is changed.
493 This can be used to implement
494 zero-copy forwarding, see
495 .Sx ZERO-COPY FORWARDING .
497 reports when this buffer has been transmitted.
500 notifies transmit completions in batches, hence signals
501 can be delayed indefinitely.
502 This flag helps detect
503 when packets have been sent and a file descriptor can be closed.
505 When a ring is in 'transparent' mode,
506 packets marked with this flag by the user application are forwarded to the
507 other endpoint at the next system call, thus restoring (in a selective way)
508 the connection between a NIC and the host stack.
510 tells the forwarding code that the source MAC address for this
511 packet must not be used in the learning bridge code.
513 indicates that the packet's payload is in a user-supplied buffer
514 whose user virtual address is in the 'ptr' field of the slot.
515 The size can reach 65535 bytes.
517 This is only supported on the transmit ring of
519 ports, and it helps reducing data copies in the interconnection
522 indicates that the packet continues with subsequent buffers;
523 the last buffer in a packet must have the flag clear.
525 .Sh SCATTER GATHER I/O
526 Packets can span multiple slots if the
528 flag is set in all but the last slot.
529 The maximum length of a chain is 64 buffers.
530 This is normally used with
532 ports when connecting virtual machines, as they generate large
533 TSO segments that are not split unless they reach a physical device.
535 NOTE: The length field always refers to the individual
536 fragment; there is no place with the total length of a packet.
538 On receive rings the macro
540 indicates the remaining number of slots for this packet,
541 including the current one.
542 Slots with a value greater than 1 also have NS_MOREFRAG set.
545 uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
546 for non-blocking I/O.
547 They take no argument.
548 Two more ioctls (NIOCGINFO, NIOCREGIF) are used
549 to query and configure ports, with the following argument:
552 char nr_name[IFNAMSIZ]; /* (i) port name */
553 uint32_t nr_version; /* (i) API version */
554 uint32_t nr_offset; /* (o) nifp offset in mmap region */
555 uint32_t nr_memsize; /* (o) size of the mmap region */
556 uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
557 uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
558 uint16_t nr_tx_rings; /* (i/o) number of tx rings */
559 uint16_t nr_rx_rings; /* (i/o) number of rx rings */
560 uint16_t nr_ringid; /* (i/o) ring(s) we care about */
561 uint16_t nr_cmd; /* (i) special command */
562 uint16_t nr_arg1; /* (i/o) extra arguments */
563 uint16_t nr_arg2; /* (i/o) extra arguments */
564 uint32_t nr_arg3; /* (i/o) extra arguments */
565 uint32_t nr_flags /* (i/o) open mode */
570 A file descriptor obtained through
572 also supports the ioctl supported by network devices, see
576 returns EINVAL if the named port does not support netmap.
577 Otherwise, it returns 0 and (advisory) information
579 Note that all the information below can change before the
580 interface is actually put in netmap mode.
583 indicates the size of the
588 mode all share the same memory region,
591 ports have independent regions for each port.
592 .It Pa nr_tx_slots , nr_rx_slots
593 indicate the size of transmit and receive rings.
594 .It Pa nr_tx_rings , nr_rx_rings
595 indicate the number of transmit
597 Both ring number and sizes may be configured at runtime
598 using interface-specific functions (e.g.,
603 binds the port named in
605 to the file descriptor.
606 For a physical device this also switches it into
609 it from the host stack.
610 Multiple file descriptors can be bound to the same port,
611 with proper synchronization left to the user.
613 The recommended way to bind a file descriptor to a port is
618 which parses names to access specific port types and
620 In the following we document the main features.
622 .Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
624 consisting of two netmap ports with a crossover connection.
625 A netmap pipe share the same memory space of the parent port,
626 and is meant to enable configuration where a master process acts
627 as a dispatcher towards slave processes.
629 To enable this function, the
631 field of the structure can be used as a hint to the kernel to
632 indicate how many pipes we expect to use, and reserve extra space
633 in the memory region.
635 On return, it gives the same info as NIOCGINFO,
640 indicating the identity of the rings controlled through the file
645 selects which rings are controlled through this file descriptor.
648 are indicated below, together with the naming schemes
649 that application libraries (such as the
651 indicated below) can use to indicate the specific set of rings.
652 In the example below, "netmap:foo" is any valid netmap port name.
653 .Bl -tag -width XXXXX
654 .It NR_REG_ALL_NIC "netmap:foo"
655 (default) all hardware ring pairs
656 .It NR_REG_SW "netmap:foo^"
657 the ``host rings'', connecting to the host stack.
658 .It NR_REG_NIC_SW "netmap:foo*"
659 all hardware rings and the host rings
660 .It NR_REG_ONE_NIC "netmap:foo-i"
661 only the i-th hardware ring pair, where the number is in
663 .It NR_REG_PIPE_MASTER "netmap:foo{i"
664 the master side of the netmap pipe whose identifier (i) is in
666 .It NR_REG_PIPE_SLAVE "netmap:foo}i"
667 the slave side of the netmap pipe whose identifier (i) is in
670 The identifier of a pipe must be thought as part of the pipe name,
671 and does not need to be sequential.
673 will only have a single ring pair with index 0,
674 irrespective of the value of
682 call pushes out any pending packets on the transmit ring, even if
683 no write events are specified.
684 The feature can be disabled by or-ing
685 .Va NETMAP_NO_TX_POLL
686 to the value written to
688 When this feature is used,
689 packets are transmitted only on
690 .Va ioctl(NIOCTXSYNC)
694 are called with a write event (POLLOUT/wfdset) or a full ring.
696 When registering a virtual interface that is dynamically created to a
698 switch, we can specify the desired number of rings (1 by default,
699 and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
701 tells the hardware of new packets to transmit, and updates the
702 number of slots available for transmission.
704 tells the hardware of consumed packets, and asks for newly available
707 .Sh SELECT, POLL, EPOLL, KQUEUE
713 file descriptor process rings as indicated in
717 respectively when write (POLLOUT) and read (POLLIN) events are requested.
718 Both block if no slots are available in the ring
719 .Va ( ring->cur == ring->tail ) .
720 Depending on the platform,
726 Packets in transmit rings are normally pushed out
727 (and buffers reclaimed) even without
728 requesting write events.
730 .Dv NETMAP_NO_TX_POLL
733 disables this feature.
734 By default, receive rings are processed only if read
735 events are requested.
737 .Dv NETMAP_DO_RX_POLL
739 .Em NIOCREGIF updates receive rings even without read events.
744 .Dv NETMAP_NO_TX_POLL
746 .Dv NETMAP_DO_RX_POLL
747 only have an effect when some event is posted for the file descriptor.
751 API is supposed to be used directly, both because of its simplicity and
752 for efficient integration with applications.
755 .In net/netmap_user.h
756 header provides a few macros and functions to ease creating
757 a file descriptor and doing I/O with a
760 These are loosely modeled after the
762 API, to ease porting of libpcap-based applications to
764 To use these extra functions, programs should
765 .Dl #define NETMAP_WITH_LIBS
767 .Dl #include <net/netmap_user.h>
769 The following functions are available:
770 .Bl -tag -width XXXXX
771 .It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg )
773 .Xr pcap_open_live 3 ,
774 binds a file descriptor to a port.
777 is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
781 provides the initial values for the argument to the NIOCREGIF ioctl.
782 The nm_flags and nm_ringid values are overwritten by parsing
783 ifname and flags, and other fields can be overridden through
784 the other two arguments.
786 points to a struct nm_desc containing arguments (e.g., from a previously
787 open file descriptor) that should override the defaults.
788 The fields are used as described below
790 can be set to a combination of the following flags:
791 .Va NETMAP_NO_TX_POLL ,
792 .Va NETMAP_DO_RX_POLL
793 (copied into nr_ringid);
795 (if arg points to the same memory region,
796 avoids the mmap and uses the values from it);
798 (ignores ifname and uses the values in arg);
802 (uses the fields from arg);
804 (uses the ring number and sizes from arg).
806 .It Va int nm_close(struct nm_desc *d )
807 closes the file descriptor, unmaps memory, frees resources.
808 .It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size )
811 pushes a packet to a ring, returns the size
812 of the packet is successful, or 0 on error;
813 .It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg )
815 .Va pcap_dispatch() ,
816 applies a callback to incoming packets
817 .It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr )
820 fetches the next packet
822 .Sh SUPPORTED DEVICES
824 natively supports the following devices:
831 .Pq providing Xr igb 4 and Xr em 4 ,
837 On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3.
839 NICs without native support can still be used in
841 mode through emulation.
842 Performance is inferior to native netmap
843 mode but still significantly higher than various raw socket types
844 (bpf, PF_PACKET, etc.).
845 Note that for slow devices (such as 1 Gbit/s and slower NICs,
846 or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
847 emulated and native mode will likely have similar or same throughput.
849 When emulation is in use, packet sniffer programs such as tcpdump
850 could see received packets before they are diverted by netmap.
851 This behaviour is not intentional, being just an artifact of the implementation
853 Note that in case the netmap application subsequently moves packets received
854 from the emulated adapter onto the host RX ring, the sniffer will intercept
855 those packets again, since the packets are injected to the host stack as they
856 were received by the network interface.
858 Emulation is also available for devices with native netmap support,
859 which can be used for testing or performance comparison.
861 .Va dev.netmap.admode
862 globally controls how netmap mode is implemented.
863 .Sh SYSCTL VARIABLES AND MODULE PARAMETERS
864 Some aspects of the operation of
868 are controlled through sysctl variables on
871 and module parameters on Linux
872 .Em ( /sys/module/netmap/parameters/* ) :
873 .Bl -tag -width indent
874 .It Va dev.netmap.admode: 0
875 Controls the use of native or emulated adapter mode.
877 0 uses the best available option;
879 1 forces native mode and fails if not available;
881 2 forces emulated hence never fails.
882 .It Va dev.netmap.generic_rings: 1
883 Number of rings used for emulated netmap mode
884 .It Va dev.netmap.generic_ringsize: 1024
885 Ring size used for emulated netmap mode
886 .It Va dev.netmap.generic_mit: 100000
887 Controls interrupt moderation for emulated mode
888 .It Va dev.netmap.fwd: 0
889 Forces NS_FORWARD mode
890 .It Va dev.netmap.txsync_retry: 2
891 Number of txsync loops in the
894 .It Va dev.netmap.no_pendintr: 1
895 Forces recovery of transmit buffers on system calls
896 .It Va dev.netmap.no_timestamp: 0
897 Disables the update of the timestamp in the netmap ring
898 .It Va dev.netmap.verbose: 0
899 Verbose kernel messages
900 .It Va dev.netmap.buf_num: 163840
901 .It Va dev.netmap.buf_size: 2048
902 .It Va dev.netmap.ring_num: 200
903 .It Va dev.netmap.ring_size: 36864
904 .It Va dev.netmap.if_num: 100
905 .It Va dev.netmap.if_size: 1024
906 Sizes and number of objects (netmap_if, netmap_ring, buffers)
907 for the global memory region.
908 The only parameter worth modifying is
909 .Va dev.netmap.buf_num
910 as it impacts the total amount of memory used by netmap.
911 .It Va dev.netmap.buf_curr_num: 0
912 .It Va dev.netmap.buf_curr_size: 0
913 .It Va dev.netmap.ring_curr_num: 0
914 .It Va dev.netmap.ring_curr_size: 0
915 .It Va dev.netmap.if_curr_num: 0
916 .It Va dev.netmap.if_curr_size: 0
917 Actual values in use.
918 .It Va dev.netmap.priv_buf_num: 4098
919 .It Va dev.netmap.priv_buf_size: 2048
920 .It Va dev.netmap.priv_ring_num: 4
921 .It Va dev.netmap.priv_ring_size: 20480
922 .It Va dev.netmap.priv_if_num: 2
923 .It Va dev.netmap.priv_if_size: 1024
924 Sizes and number of objects (netmap_if, netmap_ring, buffers)
925 for private memory regions.
926 A separate memory region is used for each
928 port and each pair of
930 .It Va dev.netmap.bridge_batch: 1024
931 Batch size used when moving packets across a
934 Values above 64 generally guarantee good
936 .It Va dev.netmap.ptnet_vnet_hdr: 1
937 Allow ptnet devices to use virtio-net headers
947 to wake up processes when significant events occur, and
951 is used to configure ports and
954 Applications may need to create threads and bind them to
955 specific cores to improve performance, using standard
959 .Xr pthread_setaffinity_np 3
964 comes with a few programs that can be used for testing or
971 .Pa tools/tools/netmap/
977 is a general purpose traffic source/sink.
980 .Dl pkt-gen -i ix0 -f tx -l 60
981 can generate an infinite stream of minimum size packets, and
982 .Dl pkt-gen -i ix0 -f rx
984 Both print traffic statistics, to help monitor
985 how the system performs.
988 has many options can be uses to set packet sizes, addresses,
989 rates, and use multiple send/receive threads and cores.
992 is another test program which interconnects two
995 It can be used for transparent forwarding between
997 .Dl bridge -i netmap:ix0 -i netmap:ix1
998 or even connect the NIC to the host stack using netmap
999 .Dl bridge -i netmap:ix0
1000 .Ss USING THE NATIVE API
1001 The following code implements a traffic generator:
1003 .Bd -literal -compact
1004 #include <net/netmap_user.h>
1008 struct netmap_if *nifp;
1009 struct netmap_ring *ring;
1013 fd = open("/dev/netmap", O_RDWR);
1014 bzero(&nmr, sizeof(nmr));
1015 strcpy(nmr.nr_name, "ix0");
1016 nmr.nm_version = NETMAP_API;
1017 ioctl(fd, NIOCREGIF, &nmr);
1018 p = mmap(0, nmr.nr_memsize, fd);
1019 nifp = NETMAP_IF(p, nmr.nr_offset);
1020 ring = NETMAP_TXRING(nifp, 0);
1022 fds.events = POLLOUT;
1025 while (!nm_ring_empty(ring)) {
1027 buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
1028 ... prepare packet in buf ...
1029 ring->slot[i].len = ... packet length ...
1030 ring->head = ring->cur = nm_ring_next(ring, i);
1035 .Ss HELPER FUNCTIONS
1036 A simple receiver can be implemented using the helper functions:
1038 .Bd -literal -compact
1039 #define NETMAP_WITH_LIBS
1040 #include <net/netmap_user.h>
1049 d = nm_open("netmap:ix0", NULL, 0, 0);
1050 fds.fd = NETMAP_FD(d);
1051 fds.events = POLLIN;
1054 while ( (buf = nm_nextpkt(d, &h)) )
1055 consume_pkt(buf, h.len);
1060 .Ss ZERO-COPY FORWARDING
1061 Since physical interfaces share the same memory region,
1062 it is possible to do packet forwarding between ports
1064 The buffer from the transmit ring is used
1065 to replenish the receive ring:
1067 .Bd -literal -compact
1069 struct netmap_slot *src, *dst;
1071 src = &src_ring->slot[rxr->cur];
1072 dst = &dst_ring->slot[txr->cur];
1074 dst->buf_idx = src->buf_idx;
1075 dst->len = src->len;
1076 dst->flags = NS_BUF_CHANGED;
1078 src->flags = NS_BUF_CHANGED;
1079 rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
1080 txr->head = txr->cur = nm_ring_next(txr, txr->cur);
1083 .Ss ACCESSING THE HOST STACK
1084 The host stack is for all practical purposes just a regular ring pair,
1085 which you can access with the netmap API (e.g., with
1086 .Dl nm_open("netmap:eth0^", ... ) ;
1087 All packets that the host would send to an interface in
1089 mode end up into the RX ring, whereas all packets queued to the
1090 TX ring are send up to the host stack.
1092 A simple way to test the performance of a
1094 switch is to attach a sender and a receiver to it,
1095 e.g., running the following in two different terminals:
1096 .Dl pkt-gen -i vale1:a -f rx # receiver
1097 .Dl pkt-gen -i vale1:b -f tx # sender
1098 The same example can be used to test netmap pipes, by simply
1099 changing port names, e.g.,
1100 .Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
1101 .Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
1103 The following command attaches an interface and the host stack
1105 .Dl valectl -h vale2:em0
1108 clients attached to the same switch can now communicate
1109 with the network card or the host.
1118 .Pa http://info.iet.unipi.it/~luigi/netmap/
1120 Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
1121 Communications of the ACM, 55 (3), pp.45-51, March 2012
1123 Luigi Rizzo, netmap: a novel framework for fast packet I/O,
1124 Usenix ATC'12, June 2012, Boston
1126 Luigi Rizzo, Giuseppe Lettieri,
1127 VALE, a switched ethernet for virtual machines,
1128 ACM CoNEXT'12, December 2012, Nice
1130 Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
1131 Speeding up packet I/O in virtual machines,
1132 ACM/IEEE ANCS'13, October 2013, San Jose
1137 framework has been originally designed and implemented at the
1138 Universita` di Pisa in 2011 by
1140 and further extended with help from
1142 .An Gaetano Catalli ,
1143 .An Giuseppe Lettieri ,
1145 .An Vincenzo Maffione .
1150 have been funded by the European Commission within FP7 Projects
1151 CHANGE (257422) and OPENLAB (287581).
1153 No matter how fast the CPU and OS are,
1154 achieving line rate on 10G and faster interfaces
1155 requires hardware with sufficient performance.
1156 Several NICs are unable to sustain line rate with
1158 Insufficient PCIe or memory bandwidth
1159 can also cause reduced performance.
1161 Another frequent reason for low performance is the use
1162 of flow control on the link: a slow receiver can limit
1164 Be sure to disable flow control when running high
1166 .Ss SPECIAL NIC FEATURES
1168 is orthogonal to some NIC features such as
1169 multiqueue, schedulers, packet filters.
1171 Multiple transmit and receive rings are supported natively
1172 and can be configured with ordinary OS tools,
1176 device-specific sysctl variables.
1177 The same goes for Receive Packet Steering (RPS)
1178 and filtering of incoming traffic.
1183 .Em checksum offloading , TCP segmentation offloading ,
1184 .Em encryption , VLAN encapsulation/decapsulation ,
1186 When using netmap to exchange packets with the host stack,
1187 make sure to disable these features.