2 * Copyright (C) 2011-2014 Matteo Landi, Luigi Rizzo. All rights reserved.
4 * Redistribution and use in source and binary forms, with or without
5 * modification, are permitted provided that the following conditions
7 * 1. Redistributions of source code must retain the above copyright
8 * notice, this list of conditions and the following disclaimer.
9 * 2. Redistributions in binary form must reproduce the above copyright
10 * notice, this list of conditions and the following disclaimer in the
11 * documentation and/or other materials provided with the distribution.
13 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
14 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
15 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
16 * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
17 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
18 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
19 * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
20 * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
21 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
22 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30 * This module supports memory mapped access to network devices,
33 * The module uses a large, memory pool allocated by the kernel
34 * and accessible as mmapped memory by multiple userspace threads/processes.
35 * The memory pool contains packet buffers and "netmap rings",
36 * i.e. user-accessible copies of the interface's queues.
38 * Access to the network card works like this:
39 * 1. a process/thread issues one or more open() on /dev/netmap, to create
40 * select()able file descriptor on which events are reported.
41 * 2. on each descriptor, the process issues an ioctl() to identify
42 * the interface that should report events to the file descriptor.
43 * 3. on each descriptor, the process issues an mmap() request to
44 * map the shared memory region within the process' address space.
45 * The list of interesting queues is indicated by a location in
46 * the shared memory region.
47 * 4. using the functions in the netmap(4) userspace API, a process
48 * can look up the occupation state of a queue, access memory buffers,
49 * and retrieve received packets or enqueue packets to transmit.
50 * 5. using some ioctl()s the process can synchronize the userspace view
51 * of the queue with the actual status in the kernel. This includes both
52 * receiving the notification of new packets, and transmitting new
53 * packets on the output interface.
54 * 6. select() or poll() can be used to wait for events on individual
55 * transmit or receive queues (or all queues for a given interface).
58 SYNCHRONIZATION (USER)
60 The netmap rings and data structures may be shared among multiple
61 user threads or even independent processes.
62 Any synchronization among those threads/processes is delegated
63 to the threads themselves. Only one thread at a time can be in
64 a system call on the same netmap ring. The OS does not enforce
65 this and only guarantees against system crashes in case of
70 Within the kernel, access to the netmap rings is protected as follows:
72 - a spinlock on each ring, to handle producer/consumer races on
73 RX rings attached to the host stack (against multiple host
74 threads writing from the host stack to the same ring),
75 and on 'destination' rings attached to a VALE switch
76 (i.e. RX rings in VALE ports, and TX rings in NIC/host ports)
77 protecting multiple active senders for the same destination)
79 - an atomic variable to guarantee that there is at most one
80 instance of *_*xsync() on the ring at any time.
81 For rings connected to user file
82 descriptors, an atomic_test_and_set() protects this, and the
83 lock on the ring is not actually used.
84 For NIC RX rings connected to a VALE switch, an atomic_test_and_set()
85 is also used to prevent multiple executions (the driver might indeed
86 already guarantee this).
87 For NIC TX rings connected to a VALE switch, the lock arbitrates
88 access to the queue (both when allocating buffers and when pushing
91 - *xsync() should be protected against initializations of the card.
92 On FreeBSD most devices have the reset routine protected by
93 a RING lock (ixgbe, igb, em) or core lock (re). lem is missing
94 the RING protection on rx_reset(), this should be added.
96 On linux there is an external lock on the tx path, which probably
97 also arbitrates access to the reset routine. XXX to be revised
99 - a per-interface core_lock protecting access from the host stack
100 while interfaces may be detached from netmap mode.
101 XXX there should be no need for this lock if we detach the interfaces
102 only while they are down.
107 NMG_LOCK() serializes all modifications to switches and ports.
108 A switch cannot be deleted until all ports are gone.
110 For each switch, an SX lock (RWlock on linux) protects
111 deletion of ports. When configuring or deleting a new port, the
112 lock is acquired in exclusive mode (after holding NMG_LOCK).
113 When forwarding, the lock is acquired in shared mode (without NMG_LOCK).
114 The lock is held throughout the entire forwarding cycle,
115 during which the thread may incur in a page fault.
116 Hence it is important that sleepable shared locks are used.
118 On the rx ring, the per-port lock is grabbed initially to reserve
119 a number of slot in the ring, then the lock is released,
120 packets are copied from source to destination, and then
121 the lock is acquired again and the receive ring is updated.
122 (A similar thing is done on the tx ring for NIC and host stack
123 ports attached to the switch)
128 /* --- internals ----
130 * Roadmap to the code that implements the above.
132 * > 1. a process/thread issues one or more open() on /dev/netmap, to create
133 * > select()able file descriptor on which events are reported.
135 * Internally, we allocate a netmap_priv_d structure, that will be
136 * initialized on ioctl(NIOCREGIF).
139 * FreeBSD: netmap_open (netmap_freebsd.c). The priv is
141 * linux: linux_netmap_open (netmap_linux.c). The priv is
144 * > 2. on each descriptor, the process issues an ioctl() to identify
145 * > the interface that should report events to the file descriptor.
147 * Implemented by netmap_ioctl(), NIOCREGIF case, with nmr->nr_cmd==0.
148 * Most important things happen in netmap_get_na() and
149 * netmap_do_regif(), called from there. Additional details can be
150 * found in the comments above those functions.
152 * In all cases, this action creates/takes-a-reference-to a
153 * netmap_*_adapter describing the port, and allocates a netmap_if
154 * and all necessary netmap rings, filling them with netmap buffers.
156 * In this phase, the sync callbacks for each ring are set (these are used
157 * in steps 5 and 6 below). The callbacks depend on the type of adapter.
158 * The adapter creation/initialization code puts them in the
159 * netmap_adapter (fields na->nm_txsync and na->nm_rxsync). Then, they
160 * are copied from there to the netmap_kring's during netmap_do_regif(), by
161 * the nm_krings_create() callback. All the nm_krings_create callbacks
162 * actually call netmap_krings_create() to perform this and the other
163 * common stuff. netmap_krings_create() also takes care of the host rings,
164 * if needed, by setting their sync callbacks appropriately.
166 * Additional actions depend on the kind of netmap_adapter that has been
169 * - netmap_hw_adapter: [netmap.c]
170 * This is a system netdev/ifp with native netmap support.
171 * The ifp is detached from the host stack by redirecting:
172 * - transmissions (from the network stack) to netmap_transmit()
173 * - receive notifications to the nm_notify() callback for
174 * this adapter. The callback is normally netmap_notify(), unless
175 * the ifp is attached to a bridge using bwrap, in which case it
176 * is netmap_bwrap_intr_notify().
178 * - netmap_generic_adapter: [netmap_generic.c]
179 * A system netdev/ifp without native netmap support.
181 * (the decision about native/non native support is taken in
182 * netmap_get_hw_na(), called by netmap_get_na())
184 * - netmap_vp_adapter [netmap_vale.c]
185 * Returned by netmap_get_bdg_na().
186 * This is a persistent or ephemeral VALE port. Ephemeral ports
187 * are created on the fly if they don't already exist, and are
188 * always attached to a bridge.
189 * Persistent VALE ports must must be created seperately, and i
190 * then attached like normal NICs. The NIOCREGIF we are examining
191 * will find them only if they had previosly been created and
192 * attached (see VALE_CTL below).
194 * - netmap_pipe_adapter [netmap_pipe.c]
195 * Returned by netmap_get_pipe_na().
196 * Both pipe ends are created, if they didn't already exist.
198 * - netmap_monitor_adapter [netmap_monitor.c]
199 * Returned by netmap_get_monitor_na().
200 * If successful, the nm_sync callbacks of the monitored adapter
201 * will be intercepted by the returned monitor.
203 * - netmap_bwrap_adapter [netmap_vale.c]
204 * Cannot be obtained in this way, see VALE_CTL below
208 * linux: we first go through linux_netmap_ioctl() to
209 * adapt the FreeBSD interface to the linux one.
212 * > 3. on each descriptor, the process issues an mmap() request to
213 * > map the shared memory region within the process' address space.
214 * > The list of interesting queues is indicated by a location in
215 * > the shared memory region.
218 * FreeBSD: netmap_mmap_single (netmap_freebsd.c).
219 * linux: linux_netmap_mmap (netmap_linux.c).
221 * > 4. using the functions in the netmap(4) userspace API, a process
222 * > can look up the occupation state of a queue, access memory buffers,
223 * > and retrieve received packets or enqueue packets to transmit.
225 * these actions do not involve the kernel.
227 * > 5. using some ioctl()s the process can synchronize the userspace view
228 * > of the queue with the actual status in the kernel. This includes both
229 * > receiving the notification of new packets, and transmitting new
230 * > packets on the output interface.
232 * These are implemented in netmap_ioctl(), NIOCTXSYNC and NIOCRXSYNC
233 * cases. They invoke the nm_sync callbacks on the netmap_kring
234 * structures, as initialized in step 2 and maybe later modified
235 * by a monitor. Monitors, however, will always call the original
236 * callback before doing anything else.
239 * > 6. select() or poll() can be used to wait for events on individual
240 * > transmit or receive queues (or all queues for a given interface).
242 * Implemented in netmap_poll(). This will call the same nm_sync()
243 * callbacks as in step 5 above.
246 * linux: we first go through linux_netmap_poll() to adapt
247 * the FreeBSD interface to the linux one.
250 * ---- VALE_CTL -----
252 * VALE switches are controlled by issuing a NIOCREGIF with a non-null
253 * nr_cmd in the nmreq structure. These subcommands are handled by
254 * netmap_bdg_ctl() in netmap_vale.c. Persistent VALE ports are created
255 * and destroyed by issuing the NETMAP_BDG_NEWIF and NETMAP_BDG_DELIF
256 * subcommands, respectively.
258 * Any network interface known to the system (including a persistent VALE
259 * port) can be attached to a VALE switch by issuing the
260 * NETMAP_BDG_ATTACH subcommand. After the attachment, persistent VALE ports
261 * look exactly like ephemeral VALE ports (as created in step 2 above). The
262 * attachment of other interfaces, instead, requires the creation of a
263 * netmap_bwrap_adapter. Moreover, the attached interface must be put in
264 * netmap mode. This may require the creation of a netmap_generic_adapter if
265 * we have no native support for the interface, or if generic adapters have
266 * been forced by sysctl.
268 * Both persistent VALE ports and bwraps are handled by netmap_get_bdg_na(),
269 * called by nm_bdg_ctl_attach(), and discriminated by the nm_bdg_attach()
270 * callback. In the case of the bwrap, the callback creates the
271 * netmap_bwrap_adapter. The initialization of the bwrap is then
272 * completed by calling netmap_do_regif() on it, in the nm_bdg_ctl()
273 * callback (netmap_bwrap_bdg_ctl in netmap_vale.c).
274 * A generic adapter for the wrapped ifp will be created if needed, when
275 * netmap_get_bdg_na() calls netmap_get_hw_na().
278 * ---- DATAPATHS -----
280 * -= SYSTEM DEVICE WITH NATIVE SUPPORT =-
282 * na == NA(ifp) == netmap_hw_adapter created in DEVICE_netmap_attach()
284 * - tx from netmap userspace:
286 * 1) ioctl(NIOCTXSYNC)/netmap_poll() in process context
287 * kring->nm_sync() == DEVICE_netmap_txsync()
288 * 2) device interrupt handler
289 * na->nm_notify() == netmap_notify()
290 * - rx from netmap userspace:
292 * 1) ioctl(NIOCRXSYNC)/netmap_poll() in process context
293 * kring->nm_sync() == DEVICE_netmap_rxsync()
294 * 2) device interrupt handler
295 * na->nm_notify() == netmap_notify()
296 * - tx from host stack
300 * na->nm_notify == netmap_notify()
301 * 2) ioctl(NIOCRXSYNC)/netmap_poll() in process context
302 * kring->nm_sync() == netmap_rxsync_from_host_compat
303 * netmap_rxsync_from_host(na, NULL, NULL)
305 * ioctl(NIOCTXSYNC)/netmap_poll() in process context
306 * kring->nm_sync() == netmap_txsync_to_host_compat
307 * netmap_txsync_to_host(na)
309 * FreeBSD: na->if_input() == ?? XXX
310 * linux: netif_rx() with NM_MAGIC_PRIORITY_RX
314 * -= SYSTEM DEVICE WITH GENERIC SUPPORT =-
326 * -= SYSTEM DEVICE WITH NATIVE SUPPORT, CONNECTED TO VALE, NO HOST RINGS =-
330 * -= SYSTEM DEVICE WITH NATIVE SUPPORT, CONNECTED TO VALE, WITH HOST RINGS =-
334 * -= SYSTEM DEVICE WITH GENERIC SUPPORT, CONNECTED TO VALE, NO HOST RINGS =-
338 * -= SYSTEM DEVICE WITH GENERIC SUPPORT, CONNECTED TO VALE, WITH HOST RINGS =-
345 * OS-specific code that is used only within this file.
346 * Other OS-specific code that must be accessed by drivers
347 * is present in netmap_kern.h
350 #if defined(__FreeBSD__)
351 #include <sys/cdefs.h> /* prerequisite */
352 #include <sys/types.h>
353 #include <sys/errno.h>
354 #include <sys/param.h> /* defines used in kernel.h */
355 #include <sys/kernel.h> /* types used in module initialization */
356 #include <sys/conf.h> /* cdevsw struct, UID, GID */
357 #include <sys/filio.h> /* FIONBIO */
358 #include <sys/sockio.h>
359 #include <sys/socketvar.h> /* struct socket */
360 #include <sys/malloc.h>
361 #include <sys/poll.h>
362 #include <sys/rwlock.h>
363 #include <sys/socket.h> /* sockaddrs */
364 #include <sys/selinfo.h>
365 #include <sys/sysctl.h>
366 #include <sys/jail.h>
367 #include <net/vnet.h>
369 #include <net/if_var.h>
370 #include <net/bpf.h> /* BIOCIMMEDIATE */
371 #include <machine/bus.h> /* bus_dmamap_* */
372 #include <sys/endian.h>
373 #include <sys/refcount.h>
376 /* reduce conditional code */
377 // linux API, use for the knlist in FreeBSD
378 /* use a private mutex for the knlist */
379 #define init_waitqueue_head(x) do { \
380 struct mtx *m = &(x)->m; \
381 mtx_init(m, "nm_kn_lock", NULL, MTX_DEF); \
382 knlist_init_mtx(&(x)->si.si_note, m); \
385 #define OS_selrecord(a, b) selrecord(a, &((b)->si))
386 #define OS_selwakeup(a, b) freebsd_selwakeup(a, b)
390 #include "bsd_glue.h"
394 #elif defined(__APPLE__)
396 #warning OSX support is only partial
397 #include "osx_glue.h"
401 #error Unsupported platform
403 #endif /* unsupported */
408 #include <net/netmap.h>
409 #include <dev/netmap/netmap_kern.h>
410 #include <dev/netmap/netmap_mem2.h>
413 MALLOC_DEFINE(M_NETMAP, "netmap", "Network memory map");
416 * The following variables are used by the drivers and replicate
417 * fields in the global memory pool. They only refer to buffers
418 * used by physical interfaces.
420 u_int netmap_total_buffers;
421 u_int netmap_buf_size;
422 char *netmap_buffer_base; /* also address of an invalid buffer */
424 /* user-controlled variables */
427 static int netmap_no_timestamp; /* don't timestamp on rxsync */
429 SYSCTL_NODE(_dev, OID_AUTO, netmap, CTLFLAG_RW, 0, "Netmap args");
430 SYSCTL_INT(_dev_netmap, OID_AUTO, verbose,
431 CTLFLAG_RW, &netmap_verbose, 0, "Verbose mode");
432 SYSCTL_INT(_dev_netmap, OID_AUTO, no_timestamp,
433 CTLFLAG_RW, &netmap_no_timestamp, 0, "no_timestamp");
434 int netmap_mitigate = 1;
435 SYSCTL_INT(_dev_netmap, OID_AUTO, mitigate, CTLFLAG_RW, &netmap_mitigate, 0, "");
436 int netmap_no_pendintr = 1;
437 SYSCTL_INT(_dev_netmap, OID_AUTO, no_pendintr,
438 CTLFLAG_RW, &netmap_no_pendintr, 0, "Always look for new received packets.");
439 int netmap_txsync_retry = 2;
440 SYSCTL_INT(_dev_netmap, OID_AUTO, txsync_retry, CTLFLAG_RW,
441 &netmap_txsync_retry, 0 , "Number of txsync loops in bridge's flush.");
443 int netmap_adaptive_io = 0;
444 SYSCTL_INT(_dev_netmap, OID_AUTO, adaptive_io, CTLFLAG_RW,
445 &netmap_adaptive_io, 0 , "Adaptive I/O on paravirt");
447 int netmap_flags = 0; /* debug flags */
448 int netmap_fwd = 0; /* force transparent mode */
449 int netmap_mmap_unreg = 0; /* allow mmap of unregistered fds */
452 * netmap_admode selects the netmap mode to use.
453 * Invalid values are reset to NETMAP_ADMODE_BEST
455 enum { NETMAP_ADMODE_BEST = 0, /* use native, fallback to generic */
456 NETMAP_ADMODE_NATIVE, /* either native or none */
457 NETMAP_ADMODE_GENERIC, /* force generic */
458 NETMAP_ADMODE_LAST };
459 static int netmap_admode = NETMAP_ADMODE_BEST;
461 int netmap_generic_mit = 100*1000; /* Generic mitigation interval in nanoseconds. */
462 int netmap_generic_ringsize = 1024; /* Generic ringsize. */
463 int netmap_generic_rings = 1; /* number of queues in generic. */
465 SYSCTL_INT(_dev_netmap, OID_AUTO, flags, CTLFLAG_RW, &netmap_flags, 0 , "");
466 SYSCTL_INT(_dev_netmap, OID_AUTO, fwd, CTLFLAG_RW, &netmap_fwd, 0 , "");
467 SYSCTL_INT(_dev_netmap, OID_AUTO, mmap_unreg, CTLFLAG_RW, &netmap_mmap_unreg, 0, "");
468 SYSCTL_INT(_dev_netmap, OID_AUTO, admode, CTLFLAG_RW, &netmap_admode, 0 , "");
469 SYSCTL_INT(_dev_netmap, OID_AUTO, generic_mit, CTLFLAG_RW, &netmap_generic_mit, 0 , "");
470 SYSCTL_INT(_dev_netmap, OID_AUTO, generic_ringsize, CTLFLAG_RW, &netmap_generic_ringsize, 0 , "");
471 SYSCTL_INT(_dev_netmap, OID_AUTO, generic_rings, CTLFLAG_RW, &netmap_generic_rings, 0 , "");
473 NMG_LOCK_T netmap_global_lock;
477 nm_kr_get(struct netmap_kring *kr)
479 while (NM_ATOMIC_TEST_AND_SET(&kr->nr_busy))
480 tsleep(kr, 0, "NM_KR_GET", 4);
485 * mark the ring as stopped, and run through the locks
486 * to make sure other users get to see it.
489 netmap_disable_ring(struct netmap_kring *kr)
493 mtx_lock(&kr->q_lock);
494 mtx_unlock(&kr->q_lock);
498 /* stop or enable a single tx ring */
500 netmap_set_txring(struct netmap_adapter *na, u_int ring_id, int stopped)
503 netmap_disable_ring(na->tx_rings + ring_id);
505 na->tx_rings[ring_id].nkr_stopped = 0;
506 /* nofify that the stopped state has changed. This is currently
507 *only used by bwrap to propagate the state to its own krings.
508 * (see netmap_bwrap_intr_notify).
510 na->nm_notify(na, ring_id, NR_TX, NAF_DISABLE_NOTIFY);
513 /* stop or enable a single rx ring */
515 netmap_set_rxring(struct netmap_adapter *na, u_int ring_id, int stopped)
518 netmap_disable_ring(na->rx_rings + ring_id);
520 na->rx_rings[ring_id].nkr_stopped = 0;
521 /* nofify that the stopped state has changed. This is currently
522 *only used by bwrap to propagate the state to its own krings.
523 * (see netmap_bwrap_intr_notify).
525 na->nm_notify(na, ring_id, NR_RX, NAF_DISABLE_NOTIFY);
529 /* stop or enable all the rings of na */
531 netmap_set_all_rings(struct netmap_adapter *na, int stopped)
536 if (!nm_netmap_on(na))
539 ntx = netmap_real_tx_rings(na);
540 nrx = netmap_real_rx_rings(na);
542 for (i = 0; i < ntx; i++) {
543 netmap_set_txring(na, i, stopped);
546 for (i = 0; i < nrx; i++) {
547 netmap_set_rxring(na, i, stopped);
552 * Convenience function used in drivers. Waits for current txsync()s/rxsync()s
553 * to finish and prevents any new one from starting. Call this before turning
554 * netmap mode off, or before removing the harware rings (e.g., on module
555 * onload). As a rule of thumb for linux drivers, this should be placed near
556 * each napi_disable().
559 netmap_disable_all_rings(struct ifnet *ifp)
561 netmap_set_all_rings(NA(ifp), 1 /* stopped */);
565 * Convenience function used in drivers. Re-enables rxsync and txsync on the
566 * adapter's rings In linux drivers, this should be placed near each
570 netmap_enable_all_rings(struct ifnet *ifp)
572 netmap_set_all_rings(NA(ifp), 0 /* enabled */);
577 * generic bound_checking function
580 nm_bound_var(u_int *v, u_int dflt, u_int lo, u_int hi, const char *msg)
583 const char *op = NULL;
592 } else if (oldv > hi) {
597 printf("%s %s to %d (was %d)\n", op, msg, *v, oldv);
603 * packet-dump function, user-supplied or static buffer.
604 * The destination buffer must be at least 30+4*len
607 nm_dump_buf(char *p, int len, int lim, char *dst)
609 static char _dst[8192];
611 static char hex[] ="0123456789abcdef";
612 char *o; /* output position */
614 #define P_HI(x) hex[((x) & 0xf0)>>4]
615 #define P_LO(x) hex[((x) & 0xf)]
616 #define P_C(x) ((x) >= 0x20 && (x) <= 0x7e ? (x) : '.')
619 if (lim <= 0 || lim > len)
622 sprintf(o, "buf 0x%p len %d lim %d\n", p, len, lim);
624 /* hexdump routine */
625 for (i = 0; i < lim; ) {
626 sprintf(o, "%5d: ", i);
630 for (j=0; j < 16 && i < lim; i++, j++) {
632 o[j*3+1] = P_LO(p[i]);
635 for (j=0; j < 16 && i < lim; i++, j++)
636 o[j + 48] = P_C(p[i]);
649 * Fetch configuration from the device, to cope with dynamic
650 * reconfigurations after loading the module.
652 /* call with NMG_LOCK held */
654 netmap_update_config(struct netmap_adapter *na)
656 u_int txr, txd, rxr, rxd;
658 txr = txd = rxr = rxd = 0;
659 if (na->nm_config == NULL ||
660 na->nm_config(na, &txr, &txd, &rxr, &rxd)) {
661 /* take whatever we had at init time */
662 txr = na->num_tx_rings;
663 txd = na->num_tx_desc;
664 rxr = na->num_rx_rings;
665 rxd = na->num_rx_desc;
668 if (na->num_tx_rings == txr && na->num_tx_desc == txd &&
669 na->num_rx_rings == rxr && na->num_rx_desc == rxd)
670 return 0; /* nothing changed */
671 if (netmap_verbose || na->active_fds > 0) {
672 D("stored config %s: txring %d x %d, rxring %d x %d",
674 na->num_tx_rings, na->num_tx_desc,
675 na->num_rx_rings, na->num_rx_desc);
676 D("new config %s: txring %d x %d, rxring %d x %d",
677 na->name, txr, txd, rxr, rxd);
679 if (na->active_fds == 0) {
680 D("configuration changed (but fine)");
681 na->num_tx_rings = txr;
682 na->num_tx_desc = txd;
683 na->num_rx_rings = rxr;
684 na->num_rx_desc = rxd;
687 D("configuration changed while active, this is bad...");
691 /* kring->nm_sync callback for the host tx ring */
693 netmap_txsync_to_host_compat(struct netmap_kring *kring, int flags)
695 (void)flags; /* unused */
696 netmap_txsync_to_host(kring->na);
700 /* kring->nm_sync callback for the host rx ring */
702 netmap_rxsync_from_host_compat(struct netmap_kring *kring, int flags)
704 (void)flags; /* unused */
705 netmap_rxsync_from_host(kring->na, NULL, NULL);
711 /* create the krings array and initialize the fields common to all adapters.
712 * The array layout is this:
715 * na->tx_rings ----->| | \
716 * | | } na->num_tx_ring
720 * na->rx_rings ----> +----------+
722 * | | } na->num_rx_rings
727 * na->tailroom ----->| | \
728 * | | } tailroom bytes
732 * Note: for compatibility, host krings are created even when not needed.
733 * The tailroom space is currently used by vale ports for allocating leases.
735 /* call with NMG_LOCK held */
737 netmap_krings_create(struct netmap_adapter *na, u_int tailroom)
740 struct netmap_kring *kring;
743 /* account for the (possibly fake) host rings */
744 ntx = na->num_tx_rings + 1;
745 nrx = na->num_rx_rings + 1;
747 len = (ntx + nrx) * sizeof(struct netmap_kring) + tailroom;
749 na->tx_rings = malloc((size_t)len, M_DEVBUF, M_NOWAIT | M_ZERO);
750 if (na->tx_rings == NULL) {
751 D("Cannot allocate krings");
754 na->rx_rings = na->tx_rings + ntx;
757 * All fields in krings are 0 except the one initialized below.
758 * but better be explicit on important kring fields.
760 ndesc = na->num_tx_desc;
761 for (i = 0; i < ntx; i++) { /* Transmit rings */
762 kring = &na->tx_rings[i];
763 bzero(kring, sizeof(*kring));
766 kring->nkr_num_slots = ndesc;
767 if (i < na->num_tx_rings) {
768 kring->nm_sync = na->nm_txsync;
769 } else if (i == na->num_tx_rings) {
770 kring->nm_sync = netmap_txsync_to_host_compat;
773 * IMPORTANT: Always keep one slot empty.
775 kring->rhead = kring->rcur = kring->nr_hwcur = 0;
776 kring->rtail = kring->nr_hwtail = ndesc - 1;
777 snprintf(kring->name, sizeof(kring->name) - 1, "%s TX%d", na->name, i);
778 ND("ktx %s h %d c %d t %d",
779 kring->name, kring->rhead, kring->rcur, kring->rtail);
780 mtx_init(&kring->q_lock, "nm_txq_lock", NULL, MTX_DEF);
781 init_waitqueue_head(&kring->si);
784 ndesc = na->num_rx_desc;
785 for (i = 0; i < nrx; i++) { /* Receive rings */
786 kring = &na->rx_rings[i];
787 bzero(kring, sizeof(*kring));
790 kring->nkr_num_slots = ndesc;
791 if (i < na->num_rx_rings) {
792 kring->nm_sync = na->nm_rxsync;
793 } else if (i == na->num_rx_rings) {
794 kring->nm_sync = netmap_rxsync_from_host_compat;
796 kring->rhead = kring->rcur = kring->nr_hwcur = 0;
797 kring->rtail = kring->nr_hwtail = 0;
798 snprintf(kring->name, sizeof(kring->name) - 1, "%s RX%d", na->name, i);
799 ND("krx %s h %d c %d t %d",
800 kring->name, kring->rhead, kring->rcur, kring->rtail);
801 mtx_init(&kring->q_lock, "nm_rxq_lock", NULL, MTX_DEF);
802 init_waitqueue_head(&kring->si);
804 init_waitqueue_head(&na->tx_si);
805 init_waitqueue_head(&na->rx_si);
807 na->tailroom = na->rx_rings + nrx;
815 netmap_knlist_destroy(NM_SELINFO_T *si)
817 /* XXX kqueue(9) needed; these will mirror knlist_init. */
818 knlist_delete(&si->si.si_note, curthread, 0 /* not locked */ );
819 knlist_destroy(&si->si.si_note);
820 /* now we don't need the mutex anymore */
823 #endif /* __FreeBSD__ */
826 /* undo the actions performed by netmap_krings_create */
827 /* call with NMG_LOCK held */
829 netmap_krings_delete(struct netmap_adapter *na)
831 struct netmap_kring *kring = na->tx_rings;
833 /* we rely on the krings layout described above */
834 for ( ; kring != na->tailroom; kring++) {
835 mtx_destroy(&kring->q_lock);
836 netmap_knlist_destroy(&kring->si);
838 free(na->tx_rings, M_DEVBUF);
839 na->tx_rings = na->rx_rings = na->tailroom = NULL;
844 * Destructor for NIC ports. They also have an mbuf queue
845 * on the rings connected to the host so we need to purge
848 /* call with NMG_LOCK held */
850 netmap_hw_krings_delete(struct netmap_adapter *na)
852 struct mbq *q = &na->rx_rings[na->num_rx_rings].rx_queue;
854 ND("destroy sw mbq with len %d", mbq_len(q));
857 netmap_krings_delete(na);
861 /* create a new netmap_if for a newly registered fd.
862 * If this is the first registration of the adapter,
863 * also create the netmap rings and their in-kernel view,
866 /* call with NMG_LOCK held */
867 static struct netmap_if*
868 netmap_if_new(struct netmap_adapter *na)
870 struct netmap_if *nifp;
872 if (netmap_update_config(na)) {
873 /* configuration mismatch, report and fail */
877 if (na->active_fds) /* already registered */
880 /* create and init the krings arrays.
881 * Depending on the adapter, this may also create
882 * the netmap rings themselves
884 if (na->nm_krings_create(na))
887 /* create all missing netmap rings */
888 if (netmap_mem_rings_create(na))
893 /* in all cases, create a new netmap if */
894 nifp = netmap_mem_if_new(na);
902 if (na->active_fds == 0) {
903 netmap_mem_rings_delete(na);
904 na->nm_krings_delete(na);
911 /* grab a reference to the memory allocator, if we don't have one already. The
912 * reference is taken from the netmap_adapter registered with the priv.
914 /* call with NMG_LOCK held */
916 netmap_get_memory_locked(struct netmap_priv_d* p)
918 struct netmap_mem_d *nmd;
921 if (p->np_na == NULL) {
922 if (!netmap_mmap_unreg)
924 /* for compatibility with older versions of the API
925 * we use the global allocator when no interface has been
930 nmd = p->np_na->nm_mem;
932 if (p->np_mref == NULL) {
933 error = netmap_mem_finalize(nmd, p->np_na);
936 } else if (p->np_mref != nmd) {
937 /* a virtual port has been registered, but previous
938 * syscalls already used the global allocator.
947 /* call with NMG_LOCK *not* held */
949 netmap_get_memory(struct netmap_priv_d* p)
953 error = netmap_get_memory_locked(p);
959 /* call with NMG_LOCK held */
961 netmap_have_memory_locked(struct netmap_priv_d* p)
963 return p->np_mref != NULL;
967 /* call with NMG_LOCK held */
969 netmap_drop_memory_locked(struct netmap_priv_d* p)
972 netmap_mem_deref(p->np_mref, p->np_na);
979 * Call nm_register(ifp,0) to stop netmap mode on the interface and
980 * revert to normal operation.
981 * The second argument is the nifp to work on. In some cases it is
982 * not attached yet to the netmap_priv_d so we need to pass it as
983 * a separate argument.
985 /* call with NMG_LOCK held */
987 netmap_do_unregif(struct netmap_priv_d *priv, struct netmap_if *nifp)
989 struct netmap_adapter *na = priv->np_na;
993 if (na->active_fds <= 0) { /* last instance */
996 D("deleting last instance for %s", na->name);
998 * (TO CHECK) This function is only called
999 * when the last reference to this file descriptor goes
1000 * away. This means we cannot have any pending poll()
1001 * or interrupt routine operating on the structure.
1002 * XXX The file may be closed in a thread while
1003 * another thread is using it.
1004 * Linux keeps the file opened until the last reference
1005 * by any outstanding ioctl/poll or mmap is gone.
1006 * FreeBSD does not track mmap()s (but we do) and
1007 * wakes up any sleeping poll(). Need to check what
1008 * happens if the close() occurs while a concurrent
1009 * syscall is running.
1011 na->nm_register(na, 0); /* off, clear flags */
1012 /* Wake up any sleeping threads. netmap_poll will
1013 * then return POLLERR
1014 * XXX The wake up now must happen during *_down(), when
1015 * we order all activities to stop. -gl
1017 netmap_knlist_destroy(&na->tx_si);
1018 netmap_knlist_destroy(&na->rx_si);
1020 /* delete rings and buffers */
1021 netmap_mem_rings_delete(na);
1022 na->nm_krings_delete(na);
1024 /* delete the nifp */
1025 netmap_mem_if_delete(na, nifp);
1028 /* call with NMG_LOCK held */
1030 nm_tx_si_user(struct netmap_priv_d *priv)
1032 return (priv->np_na != NULL &&
1033 (priv->np_txqlast - priv->np_txqfirst > 1));
1036 /* call with NMG_LOCK held */
1038 nm_rx_si_user(struct netmap_priv_d *priv)
1040 return (priv->np_na != NULL &&
1041 (priv->np_rxqlast - priv->np_rxqfirst > 1));
1046 * Destructor of the netmap_priv_d, called when the fd has
1047 * no active open() and mmap(). Also called in error paths.
1049 * returns 1 if this is the last instance and we can free priv
1051 /* call with NMG_LOCK held */
1053 netmap_dtor_locked(struct netmap_priv_d *priv)
1055 struct netmap_adapter *na = priv->np_na;
1059 * np_refcount is the number of active mmaps on
1060 * this file descriptor
1062 if (--priv->np_refcount > 0) {
1065 #endif /* __FreeBSD__ */
1067 return 1; //XXX is it correct?
1069 netmap_do_unregif(priv, priv->np_nifp);
1070 priv->np_nifp = NULL;
1071 netmap_drop_memory_locked(priv);
1073 if (nm_tx_si_user(priv))
1075 if (nm_rx_si_user(priv))
1077 netmap_adapter_put(na);
1084 /* call with NMG_LOCK *not* held */
1086 netmap_dtor(void *data)
1088 struct netmap_priv_d *priv = data;
1092 last_instance = netmap_dtor_locked(priv);
1094 if (last_instance) {
1095 bzero(priv, sizeof(*priv)); /* for safety */
1096 free(priv, M_DEVBUF);
1104 * Handlers for synchronization of the queues from/to the host.
1105 * Netmap has two operating modes:
1106 * - in the default mode, the rings connected to the host stack are
1107 * just another ring pair managed by userspace;
1108 * - in transparent mode (XXX to be defined) incoming packets
1109 * (from the host or the NIC) are marked as NS_FORWARD upon
1110 * arrival, and the user application has a chance to reset the
1111 * flag for packets that should be dropped.
1112 * On the RXSYNC or poll(), packets in RX rings between
1113 * kring->nr_kcur and ring->cur with NS_FORWARD still set are moved
1114 * to the other side.
1115 * The transfer NIC --> host is relatively easy, just encapsulate
1116 * into mbufs and we are done. The host --> NIC side is slightly
1117 * harder because there might not be room in the tx ring so it
1118 * might take a while before releasing the buffer.
1123 * pass a chain of buffers to the host stack as coming from 'dst'
1124 * We do not need to lock because the queue is private.
1127 netmap_send_up(struct ifnet *dst, struct mbq *q)
1131 /* send packets up, outside the lock */
1132 while ((m = mbq_dequeue(q)) != NULL) {
1133 if (netmap_verbose & NM_VERB_HOST)
1134 D("sending up pkt %p size %d", m, MBUF_LEN(m));
1142 * put a copy of the buffers marked NS_FORWARD into an mbuf chain.
1143 * Take packets from hwcur to ring->head marked NS_FORWARD (or forced)
1144 * and pass them up. Drop remaining packets in the unlikely event
1145 * of an mbuf shortage.
1148 netmap_grab_packets(struct netmap_kring *kring, struct mbq *q, int force)
1150 u_int const lim = kring->nkr_num_slots - 1;
1151 u_int const head = kring->ring->head;
1153 struct netmap_adapter *na = kring->na;
1155 for (n = kring->nr_hwcur; n != head; n = nm_next(n, lim)) {
1157 struct netmap_slot *slot = &kring->ring->slot[n];
1159 if ((slot->flags & NS_FORWARD) == 0 && !force)
1161 if (slot->len < 14 || slot->len > NETMAP_BUF_SIZE(na)) {
1162 RD(5, "bad pkt at %d len %d", n, slot->len);
1165 slot->flags &= ~NS_FORWARD; // XXX needed ?
1166 /* XXX TODO: adapt to the case of a multisegment packet */
1167 m = m_devget(NMB(na, slot), slot->len, 0, na->ifp, NULL);
1177 * Send to the NIC rings packets marked NS_FORWARD between
1178 * kring->nr_hwcur and kring->rhead
1179 * Called under kring->rx_queue.lock on the sw rx ring,
1182 netmap_sw_to_nic(struct netmap_adapter *na)
1184 struct netmap_kring *kring = &na->rx_rings[na->num_rx_rings];
1185 struct netmap_slot *rxslot = kring->ring->slot;
1186 u_int i, rxcur = kring->nr_hwcur;
1187 u_int const head = kring->rhead;
1188 u_int const src_lim = kring->nkr_num_slots - 1;
1191 /* scan rings to find space, then fill as much as possible */
1192 for (i = 0; i < na->num_tx_rings; i++) {
1193 struct netmap_kring *kdst = &na->tx_rings[i];
1194 struct netmap_ring *rdst = kdst->ring;
1195 u_int const dst_lim = kdst->nkr_num_slots - 1;
1197 /* XXX do we trust ring or kring->rcur,rtail ? */
1198 for (; rxcur != head && !nm_ring_empty(rdst);
1199 rxcur = nm_next(rxcur, src_lim) ) {
1200 struct netmap_slot *src, *dst, tmp;
1201 u_int dst_cur = rdst->cur;
1203 src = &rxslot[rxcur];
1204 if ((src->flags & NS_FORWARD) == 0 && !netmap_fwd)
1209 dst = &rdst->slot[dst_cur];
1213 src->buf_idx = dst->buf_idx;
1214 src->flags = NS_BUF_CHANGED;
1216 dst->buf_idx = tmp.buf_idx;
1218 dst->flags = NS_BUF_CHANGED;
1220 rdst->cur = nm_next(dst_cur, dst_lim);
1222 /* if (sent) XXX txsync ? */
1229 * netmap_txsync_to_host() passes packets up. We are called from a
1230 * system call in user process context, and the only contention
1231 * can be among multiple user threads erroneously calling
1232 * this routine concurrently.
1235 netmap_txsync_to_host(struct netmap_adapter *na)
1237 struct netmap_kring *kring = &na->tx_rings[na->num_tx_rings];
1238 struct netmap_ring *ring = kring->ring;
1239 u_int const lim = kring->nkr_num_slots - 1;
1240 u_int const head = kring->rhead;
1243 /* Take packets from hwcur to head and pass them up.
1244 * force head = cur since netmap_grab_packets() stops at head
1245 * In case of no buffers we give up. At the end of the loop,
1246 * the queue is drained in all cases.
1250 netmap_grab_packets(kring, &q, 1 /* force */);
1251 ND("have %d pkts in queue", mbq_len(&q));
1252 kring->nr_hwcur = head;
1253 kring->nr_hwtail = head + lim;
1254 if (kring->nr_hwtail > lim)
1255 kring->nr_hwtail -= lim + 1;
1256 nm_txsync_finalize(kring);
1258 netmap_send_up(na->ifp, &q);
1263 * rxsync backend for packets coming from the host stack.
1264 * They have been put in kring->rx_queue by netmap_transmit().
1265 * We protect access to the kring using kring->rx_queue.lock
1267 * This routine also does the selrecord if called from the poll handler
1268 * (we know because td != NULL).
1270 * NOTE: on linux, selrecord() is defined as a macro and uses pwait
1271 * as an additional hidden argument.
1272 * returns the number of packets delivered to tx queues in
1273 * transparent mode, or a negative value if error
1276 netmap_rxsync_from_host(struct netmap_adapter *na, struct thread *td, void *pwait)
1278 struct netmap_kring *kring = &na->rx_rings[na->num_rx_rings];
1279 struct netmap_ring *ring = kring->ring;
1281 u_int const lim = kring->nkr_num_slots - 1;
1282 u_int const head = kring->rhead;
1284 struct mbq *q = &kring->rx_queue;
1286 (void)pwait; /* disable unused warnings */
1291 /* First part: import newly received packets */
1293 if (n) { /* grab packets from the queue */
1297 nm_i = kring->nr_hwtail;
1298 stop_i = nm_prev(nm_i, lim);
1299 while ( nm_i != stop_i && (m = mbq_dequeue(q)) != NULL ) {
1300 int len = MBUF_LEN(m);
1301 struct netmap_slot *slot = &ring->slot[nm_i];
1303 m_copydata(m, 0, len, NMB(na, slot));
1304 ND("nm %d len %d", nm_i, len);
1306 D("%s", nm_dump_buf(NMB(na, slot),len, 128, NULL));
1309 slot->flags = kring->nkr_slot_flags;
1310 nm_i = nm_next(nm_i, lim);
1313 kring->nr_hwtail = nm_i;
1317 * Second part: skip past packets that userspace has released.
1319 nm_i = kring->nr_hwcur;
1320 if (nm_i != head) { /* something was released */
1321 if (netmap_fwd || kring->ring->flags & NR_FORWARD)
1322 ret = netmap_sw_to_nic(na);
1323 kring->nr_hwcur = head;
1326 nm_rxsync_finalize(kring);
1328 /* access copies of cur,tail in the kring */
1329 if (kring->rcur == kring->rtail && td) /* no bufs available */
1330 OS_selrecord(td, &kring->si);
1337 /* Get a netmap adapter for the port.
1339 * If it is possible to satisfy the request, return 0
1340 * with *na containing the netmap adapter found.
1341 * Otherwise return an error code, with *na containing NULL.
1343 * When the port is attached to a bridge, we always return
1345 * Otherwise, if the port is already bound to a file descriptor,
1346 * then we unconditionally return the existing adapter into *na.
1347 * In all the other cases, we return (into *na) either native,
1348 * generic or NULL, according to the following table:
1351 * active_fds dev.netmap.admode YES NO
1352 * -------------------------------------------------------
1353 * >0 * NA(ifp) NA(ifp)
1355 * 0 NETMAP_ADMODE_BEST NATIVE GENERIC
1356 * 0 NETMAP_ADMODE_NATIVE NATIVE NULL
1357 * 0 NETMAP_ADMODE_GENERIC GENERIC GENERIC
1362 netmap_get_hw_na(struct ifnet *ifp, struct netmap_adapter **na)
1364 /* generic support */
1365 int i = netmap_admode; /* Take a snapshot. */
1367 struct netmap_adapter *prev_na;
1368 struct netmap_generic_adapter *gna;
1370 *na = NULL; /* default */
1372 /* reset in case of invalid value */
1373 if (i < NETMAP_ADMODE_BEST || i >= NETMAP_ADMODE_LAST)
1374 i = netmap_admode = NETMAP_ADMODE_BEST;
1376 if (NETMAP_CAPABLE(ifp)) {
1378 /* If an adapter already exists, return it if
1379 * there are active file descriptors or if
1380 * netmap is not forced to use generic
1383 if (NETMAP_OWNED_BY_ANY(prev_na)
1384 || i != NETMAP_ADMODE_GENERIC
1385 || prev_na->na_flags & NAF_FORCE_NATIVE
1387 /* ugly, but we cannot allow an adapter switch
1388 * if some pipe is referring to this one
1390 || prev_na->na_next_pipe > 0
1398 /* If there isn't native support and netmap is not allowed
1399 * to use generic adapters, we cannot satisfy the request.
1401 if (!NETMAP_CAPABLE(ifp) && i == NETMAP_ADMODE_NATIVE)
1404 /* Otherwise, create a generic adapter and return it,
1405 * saving the previously used netmap adapter, if any.
1407 * Note that here 'prev_na', if not NULL, MUST be a
1408 * native adapter, and CANNOT be a generic one. This is
1409 * true because generic adapters are created on demand, and
1410 * destroyed when not used anymore. Therefore, if the adapter
1411 * currently attached to an interface 'ifp' is generic, it
1413 * (NA(ifp)->active_fds > 0 || NETMAP_OWNED_BY_KERN(NA(ifp))).
1414 * Consequently, if NA(ifp) is generic, we will enter one of
1415 * the branches above. This ensures that we never override
1416 * a generic adapter with another generic adapter.
1419 error = generic_netmap_attach(ifp);
1424 gna = (struct netmap_generic_adapter*)NA(ifp);
1425 gna->prev = prev_na; /* save old na */
1426 if (prev_na != NULL) {
1427 ifunit_ref(ifp->if_xname);
1428 // XXX add a refcount ?
1429 netmap_adapter_get(prev_na);
1431 ND("Created generic NA %p (prev %p)", gna, gna->prev);
1438 * MUST BE CALLED UNDER NMG_LOCK()
1440 * Get a refcounted reference to a netmap adapter attached
1441 * to the interface specified by nmr.
1442 * This is always called in the execution of an ioctl().
1444 * Return ENXIO if the interface specified by the request does
1445 * not exist, ENOTSUP if netmap is not supported by the interface,
1446 * EBUSY if the interface is already attached to a bridge,
1447 * EINVAL if parameters are invalid, ENOMEM if needed resources
1448 * could not be allocated.
1449 * If successful, hold a reference to the netmap adapter.
1451 * No reference is kept on the real interface, which may then
1452 * disappear at any time.
1455 netmap_get_na(struct nmreq *nmr, struct netmap_adapter **na, int create)
1457 struct ifnet *ifp = NULL;
1459 struct netmap_adapter *ret = NULL;
1461 *na = NULL; /* default return value */
1465 /* we cascade through all possibile types of netmap adapter.
1466 * All netmap_get_*_na() functions return an error and an na,
1467 * with the following combinations:
1470 * 0 NULL type doesn't match
1471 * !0 NULL type matches, but na creation/lookup failed
1472 * 0 !NULL type matches and na created/found
1473 * !0 !NULL impossible
1476 /* try to see if this is a monitor port */
1477 error = netmap_get_monitor_na(nmr, na, create);
1478 if (error || *na != NULL)
1481 /* try to see if this is a pipe port */
1482 error = netmap_get_pipe_na(nmr, na, create);
1483 if (error || *na != NULL)
1486 /* try to see if this is a bridge port */
1487 error = netmap_get_bdg_na(nmr, na, create);
1491 if (*na != NULL) /* valid match in netmap_get_bdg_na() */
1495 * This must be a hardware na, lookup the name in the system.
1496 * Note that by hardware we actually mean "it shows up in ifconfig".
1497 * This may still be a tap, a veth/epair, or even a
1498 * persistent VALE port.
1500 ifp = ifunit_ref(nmr->nr_name);
1505 error = netmap_get_hw_na(ifp, &ret);
1510 netmap_adapter_get(ret);
1514 * If we are opening a pipe whose parent was not in netmap mode,
1515 * we have to allocate the pipe array now.
1516 * XXX get rid of this clumsiness (2014-03-15)
1518 error = netmap_pipe_alloc(*na, nmr);
1521 if (error && ret != NULL)
1522 netmap_adapter_put(ret);
1525 if_rele(ifp); /* allow live unloading of drivers modules */
1532 * validate parameters on entry for *_txsync()
1533 * Returns ring->cur if ok, or something >= kring->nkr_num_slots
1536 * rhead, rcur and rtail=hwtail are stored from previous round.
1537 * hwcur is the next packet to send to the ring.
1540 * hwcur <= *rhead <= head <= cur <= tail = *rtail <= hwtail
1542 * hwcur, rhead, rtail and hwtail are reliable
1545 nm_txsync_prologue(struct netmap_kring *kring)
1547 struct netmap_ring *ring = kring->ring;
1548 u_int head = ring->head; /* read only once */
1549 u_int cur = ring->cur; /* read only once */
1550 u_int n = kring->nkr_num_slots;
1552 ND(5, "%s kcur %d ktail %d head %d cur %d tail %d",
1554 kring->nr_hwcur, kring->nr_hwtail,
1555 ring->head, ring->cur, ring->tail);
1556 #if 1 /* kernel sanity checks; but we can trust the kring. */
1557 if (kring->nr_hwcur >= n || kring->rhead >= n ||
1558 kring->rtail >= n || kring->nr_hwtail >= n)
1560 #endif /* kernel sanity checks */
1562 * user sanity checks. We only use 'cur',
1563 * A, B, ... are possible positions for cur:
1565 * 0 A cur B tail C n-1
1566 * 0 D tail E cur F n-1
1568 * B, F, D are valid. A, C, E are wrong
1570 if (kring->rtail >= kring->rhead) {
1571 /* want rhead <= head <= rtail */
1572 if (head < kring->rhead || head > kring->rtail)
1574 /* and also head <= cur <= rtail */
1575 if (cur < head || cur > kring->rtail)
1577 } else { /* here rtail < rhead */
1578 /* we need head outside rtail .. rhead */
1579 if (head > kring->rtail && head < kring->rhead)
1582 /* two cases now: head <= rtail or head >= rhead */
1583 if (head <= kring->rtail) {
1584 /* want head <= cur <= rtail */
1585 if (cur < head || cur > kring->rtail)
1587 } else { /* head >= rhead */
1588 /* cur must be outside rtail..head */
1589 if (cur > kring->rtail && cur < head)
1593 if (ring->tail != kring->rtail) {
1594 RD(5, "tail overwritten was %d need %d",
1595 ring->tail, kring->rtail);
1596 ring->tail = kring->rtail;
1598 kring->rhead = head;
1603 RD(5, "%s kring error: hwcur %d rcur %d hwtail %d cur %d tail %d",
1606 kring->rcur, kring->nr_hwtail,
1613 * validate parameters on entry for *_rxsync()
1614 * Returns ring->head if ok, kring->nkr_num_slots on error.
1616 * For a valid configuration,
1617 * hwcur <= head <= cur <= tail <= hwtail
1619 * We only consider head and cur.
1620 * hwcur and hwtail are reliable.
1624 nm_rxsync_prologue(struct netmap_kring *kring)
1626 struct netmap_ring *ring = kring->ring;
1627 uint32_t const n = kring->nkr_num_slots;
1630 ND("%s kc %d kt %d h %d c %d t %d",
1632 kring->nr_hwcur, kring->nr_hwtail,
1633 ring->head, ring->cur, ring->tail);
1635 * Before storing the new values, we should check they do not
1636 * move backwards. However:
1637 * - head is not an issue because the previous value is hwcur;
1638 * - cur could in principle go back, however it does not matter
1639 * because we are processing a brand new rxsync()
1641 cur = kring->rcur = ring->cur; /* read only once */
1642 head = kring->rhead = ring->head; /* read only once */
1643 #if 1 /* kernel sanity checks */
1644 if (kring->nr_hwcur >= n || kring->nr_hwtail >= n)
1646 #endif /* kernel sanity checks */
1647 /* user sanity checks */
1648 if (kring->nr_hwtail >= kring->nr_hwcur) {
1649 /* want hwcur <= rhead <= hwtail */
1650 if (head < kring->nr_hwcur || head > kring->nr_hwtail)
1652 /* and also rhead <= rcur <= hwtail */
1653 if (cur < head || cur > kring->nr_hwtail)
1656 /* we need rhead outside hwtail..hwcur */
1657 if (head < kring->nr_hwcur && head > kring->nr_hwtail)
1659 /* two cases now: head <= hwtail or head >= hwcur */
1660 if (head <= kring->nr_hwtail) {
1661 /* want head <= cur <= hwtail */
1662 if (cur < head || cur > kring->nr_hwtail)
1665 /* cur must be outside hwtail..head */
1666 if (cur < head && cur > kring->nr_hwtail)
1670 if (ring->tail != kring->rtail) {
1671 RD(5, "%s tail overwritten was %d need %d",
1673 ring->tail, kring->rtail);
1674 ring->tail = kring->rtail;
1679 RD(5, "kring error: hwcur %d rcur %d hwtail %d head %d cur %d tail %d",
1681 kring->rcur, kring->nr_hwtail,
1682 kring->rhead, kring->rcur, ring->tail);
1688 * Error routine called when txsync/rxsync detects an error.
1689 * Can't do much more than resetting head =cur = hwcur, tail = hwtail
1690 * Return 1 on reinit.
1692 * This routine is only called by the upper half of the kernel.
1693 * It only reads hwcur (which is changed only by the upper half, too)
1694 * and hwtail (which may be changed by the lower half, but only on
1695 * a tx ring and only to increase it, so any error will be recovered
1696 * on the next call). For the above, we don't strictly need to call
1700 netmap_ring_reinit(struct netmap_kring *kring)
1702 struct netmap_ring *ring = kring->ring;
1703 u_int i, lim = kring->nkr_num_slots - 1;
1706 // XXX KASSERT nm_kr_tryget
1707 RD(10, "called for %s", kring->name);
1708 // XXX probably wrong to trust userspace
1709 kring->rhead = ring->head;
1710 kring->rcur = ring->cur;
1711 kring->rtail = ring->tail;
1713 if (ring->cur > lim)
1715 if (ring->head > lim)
1717 if (ring->tail > lim)
1719 for (i = 0; i <= lim; i++) {
1720 u_int idx = ring->slot[i].buf_idx;
1721 u_int len = ring->slot[i].len;
1722 if (idx < 2 || idx >= netmap_total_buffers) {
1723 RD(5, "bad index at slot %d idx %d len %d ", i, idx, len);
1724 ring->slot[i].buf_idx = 0;
1725 ring->slot[i].len = 0;
1726 } else if (len > NETMAP_BUF_SIZE(kring->na)) {
1727 ring->slot[i].len = 0;
1728 RD(5, "bad len at slot %d idx %d len %d", i, idx, len);
1732 RD(10, "total %d errors", errors);
1733 RD(10, "%s reinit, cur %d -> %d tail %d -> %d",
1735 ring->cur, kring->nr_hwcur,
1736 ring->tail, kring->nr_hwtail);
1737 ring->head = kring->rhead = kring->nr_hwcur;
1738 ring->cur = kring->rcur = kring->nr_hwcur;
1739 ring->tail = kring->rtail = kring->nr_hwtail;
1741 return (errors ? 1 : 0);
1744 /* interpret the ringid and flags fields of an nmreq, by translating them
1745 * into a pair of intervals of ring indices:
1747 * [priv->np_txqfirst, priv->np_txqlast) and
1748 * [priv->np_rxqfirst, priv->np_rxqlast)
1752 netmap_interp_ringid(struct netmap_priv_d *priv, uint16_t ringid, uint32_t flags)
1754 struct netmap_adapter *na = priv->np_na;
1755 u_int j, i = ringid & NETMAP_RING_MASK;
1756 u_int reg = flags & NR_REG_MASK;
1758 if (reg == NR_REG_DEFAULT) {
1759 /* convert from old ringid to flags */
1760 if (ringid & NETMAP_SW_RING) {
1762 } else if (ringid & NETMAP_HW_RING) {
1763 reg = NR_REG_ONE_NIC;
1765 reg = NR_REG_ALL_NIC;
1767 D("deprecated API, old ringid 0x%x -> ringid %x reg %d", ringid, i, reg);
1770 case NR_REG_ALL_NIC:
1771 case NR_REG_PIPE_MASTER:
1772 case NR_REG_PIPE_SLAVE:
1773 priv->np_txqfirst = 0;
1774 priv->np_txqlast = na->num_tx_rings;
1775 priv->np_rxqfirst = 0;
1776 priv->np_rxqlast = na->num_rx_rings;
1777 ND("%s %d %d", "ALL/PIPE",
1778 priv->np_rxqfirst, priv->np_rxqlast);
1782 if (!(na->na_flags & NAF_HOST_RINGS)) {
1783 D("host rings not supported");
1786 priv->np_txqfirst = (reg == NR_REG_SW ?
1787 na->num_tx_rings : 0);
1788 priv->np_txqlast = na->num_tx_rings + 1;
1789 priv->np_rxqfirst = (reg == NR_REG_SW ?
1790 na->num_rx_rings : 0);
1791 priv->np_rxqlast = na->num_rx_rings + 1;
1792 ND("%s %d %d", reg == NR_REG_SW ? "SW" : "NIC+SW",
1793 priv->np_rxqfirst, priv->np_rxqlast);
1795 case NR_REG_ONE_NIC:
1796 if (i >= na->num_tx_rings && i >= na->num_rx_rings) {
1797 D("invalid ring id %d", i);
1800 /* if not enough rings, use the first one */
1802 if (j >= na->num_tx_rings)
1804 priv->np_txqfirst = j;
1805 priv->np_txqlast = j + 1;
1807 if (j >= na->num_rx_rings)
1809 priv->np_rxqfirst = j;
1810 priv->np_rxqlast = j + 1;
1813 D("invalid regif type %d", reg);
1816 priv->np_flags = (flags & ~NR_REG_MASK) | reg;
1818 if (netmap_verbose) {
1819 D("%s: tx [%d,%d) rx [%d,%d) id %d",
1832 * Set the ring ID. For devices with a single queue, a request
1833 * for all rings is the same as a single ring.
1836 netmap_set_ringid(struct netmap_priv_d *priv, uint16_t ringid, uint32_t flags)
1838 struct netmap_adapter *na = priv->np_na;
1841 error = netmap_interp_ringid(priv, ringid, flags);
1846 priv->np_txpoll = (ringid & NETMAP_NO_TX_POLL) ? 0 : 1;
1848 /* optimization: count the users registered for more than
1849 * one ring, which are the ones sleeping on the global queue.
1850 * The default netmap_notify() callback will then
1851 * avoid signaling the global queue if nobody is using it
1853 if (nm_tx_si_user(priv))
1855 if (nm_rx_si_user(priv))
1861 * possibly move the interface to netmap-mode.
1862 * If success it returns a pointer to netmap_if, otherwise NULL.
1863 * This must be called with NMG_LOCK held.
1865 * The following na callbacks are called in the process:
1867 * na->nm_config() [by netmap_update_config]
1868 * (get current number and size of rings)
1870 * We have a generic one for linux (netmap_linux_config).
1871 * The bwrap has to override this, since it has to forward
1872 * the request to the wrapped adapter (netmap_bwrap_config).
1874 * XXX netmap_if_new calls this again (2014-03-15)
1876 * na->nm_krings_create() [by netmap_if_new]
1877 * (create and init the krings array)
1879 * One of the following:
1881 * * netmap_hw_krings_create, (hw ports)
1882 * creates the standard layout for the krings
1883 * and adds the mbq (used for the host rings).
1885 * * netmap_vp_krings_create (VALE ports)
1886 * add leases and scratchpads
1888 * * netmap_pipe_krings_create (pipes)
1889 * create the krings and rings of both ends and
1892 * * netmap_monitor_krings_create (monitors)
1893 * avoid allocating the mbq
1895 * * netmap_bwrap_krings_create (bwraps)
1896 * create both the brap krings array,
1897 * the krings array of the wrapped adapter, and
1898 * (if needed) the fake array for the host adapter
1900 * na->nm_register(, 1)
1901 * (put the adapter in netmap mode)
1903 * This may be one of the following:
1904 * (XXX these should be either all *_register or all *_reg 2014-03-15)
1906 * * netmap_hw_register (hw ports)
1907 * checks that the ifp is still there, then calls
1908 * the hardware specific callback;
1910 * * netmap_vp_reg (VALE ports)
1911 * If the port is connected to a bridge,
1912 * set the NAF_NETMAP_ON flag under the
1913 * bridge write lock.
1915 * * netmap_pipe_reg (pipes)
1916 * inform the other pipe end that it is no
1917 * longer responsibile for the lifetime of this
1920 * * netmap_monitor_reg (monitors)
1921 * intercept the sync callbacks of the monitored
1924 * * netmap_bwrap_register (bwraps)
1925 * cross-link the bwrap and hwna rings,
1926 * forward the request to the hwna, override
1927 * the hwna notify callback (to get the frames
1928 * coming from outside go through the bridge).
1930 * XXX maybe netmap_if_new() should be merged with this (2014-03-15).
1934 netmap_do_regif(struct netmap_priv_d *priv, struct netmap_adapter *na,
1935 uint16_t ringid, uint32_t flags, int *err)
1937 struct netmap_if *nifp = NULL;
1938 int error, need_mem = 0;
1941 /* ring configuration may have changed, fetch from the card */
1942 netmap_update_config(na);
1943 priv->np_na = na; /* store the reference */
1944 error = netmap_set_ringid(priv, ringid, flags);
1947 /* ensure allocators are ready */
1948 need_mem = !netmap_have_memory_locked(priv);
1950 error = netmap_get_memory_locked(priv);
1951 ND("get_memory returned %d", error);
1955 /* Allocate a netmap_if and, if necessary, all the netmap_ring's */
1956 nifp = netmap_if_new(na);
1957 if (nifp == NULL) { /* allocation failed */
1962 if (!nm_netmap_on(na)) {
1963 /* Netmap not active, set the card in netmap mode
1964 * and make it use the shared buffers.
1966 /* cache the allocator info in the na */
1967 na->na_lut = netmap_mem_get_lut(na->nm_mem);
1968 ND("%p->na_lut == %p", na, na->na_lut);
1969 na->na_lut_objtotal = netmap_mem_get_buftotal(na->nm_mem);
1970 na->na_lut_objsize = netmap_mem_get_bufsize(na->nm_mem);
1971 error = na->nm_register(na, 1); /* mode on */
1973 netmap_do_unregif(priv, nifp);
1980 /* we should drop the allocator, but only
1981 * if we were the ones who grabbed it
1984 netmap_drop_memory_locked(priv);
1989 * advertise that the interface is ready bt setting ni_nifp.
1990 * The barrier is needed because readers (poll and *SYNC)
1991 * check for priv->np_nifp != NULL without locking
1993 wmb(); /* make sure previous writes are visible to all CPUs */
1994 priv->np_nifp = nifp;
2002 * ioctl(2) support for the "netmap" device.
2004 * Following a list of accepted commands:
2006 * - SIOCGIFADDR just for convenience
2011 * Return 0 on success, errno otherwise.
2014 netmap_ioctl(struct cdev *dev, u_long cmd, caddr_t data,
2015 int fflag, struct thread *td)
2017 struct netmap_priv_d *priv = NULL;
2018 struct nmreq *nmr = (struct nmreq *) data;
2019 struct netmap_adapter *na = NULL;
2021 u_int i, qfirst, qlast;
2022 struct netmap_if *nifp;
2023 struct netmap_kring *krings;
2025 (void)dev; /* UNUSED */
2026 (void)fflag; /* UNUSED */
2028 if (cmd == NIOCGINFO || cmd == NIOCREGIF) {
2030 nmr->nr_name[sizeof(nmr->nr_name) - 1] = '\0';
2031 if (nmr->nr_version != NETMAP_API) {
2032 D("API mismatch for %s got %d need %d",
2034 nmr->nr_version, NETMAP_API);
2035 nmr->nr_version = NETMAP_API;
2037 if (nmr->nr_version < NETMAP_MIN_API ||
2038 nmr->nr_version > NETMAP_MAX_API) {
2042 CURVNET_SET(TD_TO_VNET(td));
2044 error = devfs_get_cdevpriv((void **)&priv);
2047 /* XXX ENOENT should be impossible, since the priv
2048 * is now created in the open */
2049 return (error == ENOENT ? ENXIO : error);
2053 case NIOCGINFO: /* return capabilities etc */
2054 if (nmr->nr_cmd == NETMAP_BDG_LIST) {
2055 error = netmap_bdg_ctl(nmr, NULL);
2061 /* memsize is always valid */
2062 struct netmap_mem_d *nmd = &nm_mem;
2065 if (nmr->nr_name[0] != '\0') {
2066 /* get a refcount */
2067 error = netmap_get_na(nmr, &na, 1 /* create */);
2070 nmd = na->nm_mem; /* get memory allocator */
2073 error = netmap_mem_get_info(nmd, &nmr->nr_memsize, &memflags,
2077 if (na == NULL) /* only memory info */
2080 nmr->nr_rx_slots = nmr->nr_tx_slots = 0;
2081 netmap_update_config(na);
2082 nmr->nr_rx_rings = na->num_rx_rings;
2083 nmr->nr_tx_rings = na->num_tx_rings;
2084 nmr->nr_rx_slots = na->num_rx_desc;
2085 nmr->nr_tx_slots = na->num_tx_desc;
2086 netmap_adapter_put(na);
2092 /* possibly attach/detach NIC and VALE switch */
2094 if (i == NETMAP_BDG_ATTACH || i == NETMAP_BDG_DETACH
2095 || i == NETMAP_BDG_VNET_HDR
2096 || i == NETMAP_BDG_NEWIF
2097 || i == NETMAP_BDG_DELIF) {
2098 error = netmap_bdg_ctl(nmr, NULL);
2100 } else if (i != 0) {
2101 D("nr_cmd must be 0 not %d", i);
2106 /* protect access to priv from concurrent NIOCREGIF */
2111 if (priv->np_na != NULL) { /* thread already registered */
2115 /* find the interface and a reference */
2116 error = netmap_get_na(nmr, &na, 1 /* create */); /* keep reference */
2119 if (NETMAP_OWNED_BY_KERN(na)) {
2120 netmap_adapter_put(na);
2124 nifp = netmap_do_regif(priv, na, nmr->nr_ringid, nmr->nr_flags, &error);
2125 if (!nifp) { /* reg. failed, release priv and ref */
2126 netmap_adapter_put(na);
2127 priv->np_nifp = NULL;
2130 priv->np_td = td; // XXX kqueue, debugging only
2132 /* return the offset of the netmap_if object */
2133 nmr->nr_rx_rings = na->num_rx_rings;
2134 nmr->nr_tx_rings = na->num_tx_rings;
2135 nmr->nr_rx_slots = na->num_rx_desc;
2136 nmr->nr_tx_slots = na->num_tx_desc;
2137 error = netmap_mem_get_info(na->nm_mem, &nmr->nr_memsize, &memflags,
2140 netmap_adapter_put(na);
2143 if (memflags & NETMAP_MEM_PRIVATE) {
2144 *(uint32_t *)(uintptr_t)&nifp->ni_flags |= NI_PRIV_MEM;
2146 priv->np_txsi = (priv->np_txqlast - priv->np_txqfirst > 1) ?
2147 &na->tx_si : &na->tx_rings[priv->np_txqfirst].si;
2148 priv->np_rxsi = (priv->np_rxqlast - priv->np_rxqfirst > 1) ?
2149 &na->rx_si : &na->rx_rings[priv->np_rxqfirst].si;
2152 D("requested %d extra buffers", nmr->nr_arg3);
2153 nmr->nr_arg3 = netmap_extra_alloc(na,
2154 &nifp->ni_bufs_head, nmr->nr_arg3);
2155 D("got %d extra buffers", nmr->nr_arg3);
2157 nmr->nr_offset = netmap_mem_if_offset(na->nm_mem, nifp);
2164 nifp = priv->np_nifp;
2170 mb(); /* make sure following reads are not from cache */
2172 na = priv->np_na; /* we have a reference */
2175 D("Internal error: nifp != NULL && na == NULL");
2180 if (!nm_netmap_on(na)) {
2185 if (cmd == NIOCTXSYNC) {
2186 krings = na->tx_rings;
2187 qfirst = priv->np_txqfirst;
2188 qlast = priv->np_txqlast;
2190 krings = na->rx_rings;
2191 qfirst = priv->np_rxqfirst;
2192 qlast = priv->np_rxqlast;
2195 for (i = qfirst; i < qlast; i++) {
2196 struct netmap_kring *kring = krings + i;
2197 if (nm_kr_tryget(kring)) {
2201 if (cmd == NIOCTXSYNC) {
2202 if (netmap_verbose & NM_VERB_TXSYNC)
2203 D("pre txsync ring %d cur %d hwcur %d",
2204 i, kring->ring->cur,
2206 if (nm_txsync_prologue(kring) >= kring->nkr_num_slots) {
2207 netmap_ring_reinit(kring);
2209 kring->nm_sync(kring, NAF_FORCE_RECLAIM);
2211 if (netmap_verbose & NM_VERB_TXSYNC)
2212 D("post txsync ring %d cur %d hwcur %d",
2213 i, kring->ring->cur,
2216 kring->nm_sync(kring, NAF_FORCE_READ);
2217 microtime(&na->rx_rings[i].ring->ts);
2225 error = netmap_bdg_config(nmr);
2230 ND("FIONBIO/FIOASYNC are no-ops");
2237 D("ignore BIOCIMMEDIATE/BIOCSHDRCMPLT/BIOCSHDRCMPLT/BIOCSSEESENT");
2240 default: /* allow device-specific ioctls */
2242 struct ifnet *ifp = ifunit_ref(nmr->nr_name);
2248 bzero(&so, sizeof(so));
2249 so.so_vnet = ifp->if_vnet;
2250 // so->so_proto not null.
2251 error = ifioctl(&so, cmd, data, td);
2270 * select(2) and poll(2) handlers for the "netmap" device.
2272 * Can be called for one or more queues.
2273 * Return true the event mask corresponding to ready events.
2274 * If there are no ready events, do a selrecord on either individual
2275 * selinfo or on the global one.
2276 * Device-dependent parts (locking and sync of tx/rx rings)
2277 * are done through callbacks.
2279 * On linux, arguments are really pwait, the poll table, and 'td' is struct file *
2280 * The first one is remapped to pwait as selrecord() uses the name as an
2284 netmap_poll(struct cdev *dev, int events, struct thread *td)
2286 struct netmap_priv_d *priv = NULL;
2287 struct netmap_adapter *na;
2288 struct netmap_kring *kring;
2289 u_int i, check_all_tx, check_all_rx, want_tx, want_rx, revents = 0;
2290 struct mbq q; /* packets from hw queues to host stack */
2291 void *pwait = dev; /* linux compatibility */
2295 * In order to avoid nested locks, we need to "double check"
2296 * txsync and rxsync if we decide to do a selrecord().
2297 * retry_tx (and retry_rx, later) prevent looping forever.
2299 int retry_tx = 1, retry_rx = 1;
2305 * XXX kevent has curthread->tp_fop == NULL,
2306 * so devfs_get_cdevpriv() fails. We circumvent this by passing
2307 * priv as the first argument, which is also useful to avoid
2308 * the selrecord() which are not necessary in that case.
2310 if (devfs_get_cdevpriv((void **)&priv) != 0) {
2313 D("called from kevent");
2314 priv = (struct netmap_priv_d *)dev;
2319 if (priv->np_nifp == NULL) {
2320 D("No if registered");
2323 rmb(); /* make sure following reads are not from cache */
2327 if (!nm_netmap_on(na))
2330 if (netmap_verbose & 0x8000)
2331 D("device %s events 0x%x", na->name, events);
2332 want_tx = events & (POLLOUT | POLLWRNORM);
2333 want_rx = events & (POLLIN | POLLRDNORM);
2337 * check_all_{tx|rx} are set if the card has more than one queue AND
2338 * the file descriptor is bound to all of them. If so, we sleep on
2339 * the "global" selinfo, otherwise we sleep on individual selinfo
2340 * (FreeBSD only allows two selinfo's per file descriptor).
2341 * The interrupt routine in the driver wake one or the other
2342 * (or both) depending on which clients are active.
2344 * rxsync() is only called if we run out of buffers on a POLLIN.
2345 * txsync() is called if we run out of buffers on POLLOUT, or
2346 * there are pending packets to send. The latter can be disabled
2347 * passing NETMAP_NO_TX_POLL in the NIOCREG call.
2349 check_all_tx = nm_tx_si_user(priv);
2350 check_all_rx = nm_rx_si_user(priv);
2353 * We start with a lock free round which is cheap if we have
2354 * slots available. If this fails, then lock and call the sync
2357 for (i = priv->np_rxqfirst; want_rx && i < priv->np_rxqlast; i++) {
2358 kring = &na->rx_rings[i];
2359 /* XXX compare ring->cur and kring->tail */
2360 if (!nm_ring_empty(kring->ring)) {
2362 want_rx = 0; /* also breaks the loop */
2365 for (i = priv->np_txqfirst; want_tx && i < priv->np_txqlast; i++) {
2366 kring = &na->tx_rings[i];
2367 /* XXX compare ring->cur and kring->tail */
2368 if (!nm_ring_empty(kring->ring)) {
2370 want_tx = 0; /* also breaks the loop */
2375 * If we want to push packets out (priv->np_txpoll) or
2376 * want_tx is still set, we must issue txsync calls
2377 * (on all rings, to avoid that the tx rings stall).
2378 * XXX should also check cur != hwcur on the tx rings.
2379 * Fortunately, normal tx mode has np_txpoll set.
2381 if (priv->np_txpoll || want_tx) {
2383 * The first round checks if anyone is ready, if not
2384 * do a selrecord and another round to handle races.
2385 * want_tx goes to 0 if any space is found, and is
2386 * used to skip rings with no pending transmissions.
2389 for (i = priv->np_txqfirst; i < priv->np_txqlast; i++) {
2392 kring = &na->tx_rings[i];
2393 if (!want_tx && kring->ring->cur == kring->nr_hwcur)
2395 /* only one thread does txsync */
2396 if (nm_kr_tryget(kring)) {
2397 /* either busy or stopped
2398 * XXX if the ring is stopped, sleeping would
2399 * be better. In current code, however, we only
2400 * stop the rings for brief intervals (2014-03-14)
2403 RD(2, "%p lost race on txring %d, ok",
2407 if (nm_txsync_prologue(kring) >= kring->nkr_num_slots) {
2408 netmap_ring_reinit(kring);
2411 if (kring->nm_sync(kring, 0))
2416 * If we found new slots, notify potential
2417 * listeners on the same ring.
2418 * Since we just did a txsync, look at the copies
2419 * of cur,tail in the kring.
2421 found = kring->rcur != kring->rtail;
2423 if (found) { /* notify other listeners */
2426 na->nm_notify(na, i, NR_TX, 0);
2429 if (want_tx && retry_tx && !is_kevent) {
2430 OS_selrecord(td, check_all_tx ?
2431 &na->tx_si : &na->tx_rings[priv->np_txqfirst].si);
2438 * If want_rx is still set scan receive rings.
2439 * Do it on all rings because otherwise we starve.
2442 int send_down = 0; /* transparent mode */
2443 /* two rounds here for race avoidance */
2445 for (i = priv->np_rxqfirst; i < priv->np_rxqlast; i++) {
2448 kring = &na->rx_rings[i];
2450 if (nm_kr_tryget(kring)) {
2452 RD(2, "%p lost race on rxring %d, ok",
2458 * transparent mode support: collect packets
2459 * from the rxring(s).
2460 * XXX NR_FORWARD should only be read on
2461 * physical or NIC ports
2463 if (netmap_fwd ||kring->ring->flags & NR_FORWARD) {
2464 ND(10, "forwarding some buffers up %d to %d",
2465 kring->nr_hwcur, kring->ring->cur);
2466 netmap_grab_packets(kring, &q, netmap_fwd);
2469 if (kring->nm_sync(kring, 0))
2471 if (netmap_no_timestamp == 0 ||
2472 kring->ring->flags & NR_TIMESTAMP) {
2473 microtime(&kring->ring->ts);
2475 /* after an rxsync we can use kring->rcur, rtail */
2476 found = kring->rcur != kring->rtail;
2481 na->nm_notify(na, i, NR_RX, 0);
2485 /* transparent mode XXX only during first pass ? */
2486 if (na->na_flags & NAF_HOST_RINGS) {
2487 kring = &na->rx_rings[na->num_rx_rings];
2489 && (netmap_fwd || kring->ring->flags & NR_FORWARD)) {
2490 /* XXX fix to use kring fields */
2491 if (nm_ring_empty(kring->ring))
2492 send_down = netmap_rxsync_from_host(na, td, dev);
2493 if (!nm_ring_empty(kring->ring))
2498 if (retry_rx && !is_kevent)
2499 OS_selrecord(td, check_all_rx ?
2500 &na->rx_si : &na->rx_rings[priv->np_rxqfirst].si);
2501 if (send_down > 0 || retry_rx) {
2504 goto flush_tx; /* and retry_rx */
2511 * Transparent mode: marked bufs on rx rings between
2512 * kring->nr_hwcur and ring->head
2513 * are passed to the other endpoint.
2515 * In this mode we also scan the sw rxring, which in
2516 * turn passes packets up.
2518 * XXX Transparent mode at the moment requires to bind all
2519 * rings to a single file descriptor.
2522 if (q.head && na->ifp != NULL)
2523 netmap_send_up(na->ifp, &q);
2529 /*-------------------- driver support routines -------------------*/
2531 static int netmap_hw_krings_create(struct netmap_adapter *);
2533 /* default notify callback */
2535 netmap_notify(struct netmap_adapter *na, u_int n_ring,
2536 enum txrx tx, int flags)
2538 struct netmap_kring *kring;
2541 kring = na->tx_rings + n_ring;
2542 OS_selwakeup(&kring->si, PI_NET);
2543 /* optimization: avoid a wake up on the global
2544 * queue if nobody has registered for more
2547 if (na->tx_si_users > 0)
2548 OS_selwakeup(&na->tx_si, PI_NET);
2550 kring = na->rx_rings + n_ring;
2551 OS_selwakeup(&kring->si, PI_NET);
2552 /* optimization: same as above */
2553 if (na->rx_si_users > 0)
2554 OS_selwakeup(&na->rx_si, PI_NET);
2560 /* called by all routines that create netmap_adapters.
2561 * Attach na to the ifp (if any) and provide defaults
2562 * for optional callbacks. Defaults assume that we
2563 * are creating an hardware netmap_adapter.
2566 netmap_attach_common(struct netmap_adapter *na)
2568 struct ifnet *ifp = na->ifp;
2570 if (na->num_tx_rings == 0 || na->num_rx_rings == 0) {
2571 D("%s: invalid rings tx %d rx %d",
2572 na->name, na->num_tx_rings, na->num_rx_rings);
2575 /* ifp is NULL for virtual adapters (bwrap, non-persistent VALE ports,
2576 * pipes, monitors). For bwrap we actually have a non-null ifp for
2577 * use by the external modules, but that is set after this
2578 * function has been called.
2579 * XXX this is ugly, maybe split this function in two (2014-03-14)
2584 /* the following is only needed for na that use the host port.
2585 * XXX do we have something similar for linux ?
2588 na->if_input = ifp->if_input; /* for netmap_send_up */
2589 #endif /* __FreeBSD__ */
2591 NETMAP_SET_CAPABLE(ifp);
2593 if (na->nm_krings_create == NULL) {
2594 /* we assume that we have been called by a driver,
2595 * since other port types all provide their own
2598 na->nm_krings_create = netmap_hw_krings_create;
2599 na->nm_krings_delete = netmap_hw_krings_delete;
2601 if (na->nm_notify == NULL)
2602 na->nm_notify = netmap_notify;
2605 if (na->nm_mem == NULL)
2606 /* use the global allocator */
2607 na->nm_mem = &nm_mem;
2608 if (na->nm_bdg_attach == NULL)
2609 /* no special nm_bdg_attach callback. On VALE
2610 * attach, we need to interpose a bwrap
2612 na->nm_bdg_attach = netmap_bwrap_attach;
2617 /* standard cleanup, called by all destructors */
2619 netmap_detach_common(struct netmap_adapter *na)
2621 if (na->ifp != NULL)
2622 WNA(na->ifp) = NULL; /* XXX do we need this? */
2624 if (na->tx_rings) { /* XXX should not happen */
2625 D("freeing leftover tx_rings");
2626 na->nm_krings_delete(na);
2628 netmap_pipe_dealloc(na);
2629 if (na->na_flags & NAF_MEM_OWNER)
2630 netmap_mem_private_delete(na->nm_mem);
2631 bzero(na, sizeof(*na));
2635 /* Wrapper for the register callback provided hardware drivers.
2636 * na->ifp == NULL means the the driver module has been
2637 * unloaded, so we cannot call into it.
2638 * Note that module unloading, in our patched linux drivers,
2639 * happens under NMG_LOCK and after having stopped all the
2640 * nic rings (see netmap_detach). This provides sufficient
2641 * protection for the other driver-provied callbacks
2642 * (i.e., nm_config and nm_*xsync), that therefore don't need
2646 netmap_hw_register(struct netmap_adapter *na, int onoff)
2648 struct netmap_hw_adapter *hwna =
2649 (struct netmap_hw_adapter*)na;
2651 if (na->ifp == NULL)
2652 return onoff ? ENXIO : 0;
2654 return hwna->nm_hw_register(na, onoff);
2659 * Initialize a ``netmap_adapter`` object created by driver on attach.
2660 * We allocate a block of memory with room for a struct netmap_adapter
2661 * plus two sets of N+2 struct netmap_kring (where N is the number
2662 * of hardware rings):
2663 * krings 0..N-1 are for the hardware queues.
2664 * kring N is for the host stack queue
2665 * kring N+1 is only used for the selinfo for all queues. // XXX still true ?
2666 * Return 0 on success, ENOMEM otherwise.
2669 netmap_attach(struct netmap_adapter *arg)
2671 struct netmap_hw_adapter *hwna = NULL;
2672 // XXX when is arg == NULL ?
2673 struct ifnet *ifp = arg ? arg->ifp : NULL;
2675 if (arg == NULL || ifp == NULL)
2677 hwna = malloc(sizeof(*hwna), M_DEVBUF, M_NOWAIT | M_ZERO);
2681 hwna->up.na_flags |= NAF_HOST_RINGS;
2682 strncpy(hwna->up.name, ifp->if_xname, sizeof(hwna->up.name));
2683 hwna->nm_hw_register = hwna->up.nm_register;
2684 hwna->up.nm_register = netmap_hw_register;
2685 if (netmap_attach_common(&hwna->up)) {
2686 free(hwna, M_DEVBUF);
2689 netmap_adapter_get(&hwna->up);
2692 if (ifp->netdev_ops) {
2693 /* prepare a clone of the netdev ops */
2694 #if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 28)
2695 hwna->nm_ndo.ndo_start_xmit = ifp->netdev_ops;
2697 hwna->nm_ndo = *ifp->netdev_ops;
2700 hwna->nm_ndo.ndo_start_xmit = linux_netmap_start_xmit;
2701 if (ifp->ethtool_ops) {
2702 hwna->nm_eto = *ifp->ethtool_ops;
2704 hwna->nm_eto.set_ringparam = linux_netmap_set_ringparam;
2705 #ifdef ETHTOOL_SCHANNELS
2706 hwna->nm_eto.set_channels = linux_netmap_set_channels;
2708 if (arg->nm_config == NULL) {
2709 hwna->up.nm_config = netmap_linux_config;
2714 if_printf(ifp, "netmap queues/slots: TX %d/%d, RX %d/%d\n",
2715 hwna->up.num_tx_rings, hwna->up.num_tx_desc,
2716 hwna->up.num_rx_rings, hwna->up.num_rx_desc);
2718 D("success for %s tx %d/%d rx %d/%d queues/slots",
2720 hwna->up.num_tx_rings, hwna->up.num_tx_desc,
2721 hwna->up.num_rx_rings, hwna->up.num_rx_desc
2727 D("fail, arg %p ifp %p na %p", arg, ifp, hwna);
2730 return (hwna ? EINVAL : ENOMEM);
2735 NM_DBG(netmap_adapter_get)(struct netmap_adapter *na)
2741 refcount_acquire(&na->na_refcount);
2745 /* returns 1 iff the netmap_adapter is destroyed */
2747 NM_DBG(netmap_adapter_put)(struct netmap_adapter *na)
2752 if (!refcount_release(&na->na_refcount))
2758 netmap_detach_common(na);
2763 /* nm_krings_create callback for all hardware native adapters */
2765 netmap_hw_krings_create(struct netmap_adapter *na)
2767 int ret = netmap_krings_create(na, 0);
2769 /* initialize the mbq for the sw rx ring */
2770 mbq_safe_init(&na->rx_rings[na->num_rx_rings].rx_queue);
2771 ND("initialized sw rx queue %d", na->num_rx_rings);
2779 * Called on module unload by the netmap-enabled drivers
2782 netmap_detach(struct ifnet *ifp)
2784 struct netmap_adapter *na = NA(ifp);
2790 netmap_disable_all_rings(ifp);
2791 if (!netmap_adapter_put(na)) {
2792 /* someone is still using the adapter,
2793 * tell them that the interface is gone
2796 // XXX also clear NAF_NATIVE_ON ?
2797 na->na_flags &= ~NAF_NETMAP_ON;
2798 /* give them a chance to notice */
2799 netmap_enable_all_rings(ifp);
2806 * Intercept packets from the network stack and pass them
2807 * to netmap as incoming packets on the 'software' ring.
2809 * We only store packets in a bounded mbq and then copy them
2810 * in the relevant rxsync routine.
2812 * We rely on the OS to make sure that the ifp and na do not go
2813 * away (typically the caller checks for IFF_DRV_RUNNING or the like).
2814 * In nm_register() or whenever there is a reinitialization,
2815 * we make sure to make the mode change visible here.
2818 netmap_transmit(struct ifnet *ifp, struct mbuf *m)
2820 struct netmap_adapter *na = NA(ifp);
2821 struct netmap_kring *kring;
2822 u_int len = MBUF_LEN(m);
2823 u_int error = ENOBUFS;
2827 // XXX [Linux] we do not need this lock
2828 // if we follow the down/configure/up protocol -gl
2829 // mtx_lock(&na->core_lock);
2831 if (!nm_netmap_on(na)) {
2832 D("%s not in netmap mode anymore", na->name);
2837 kring = &na->rx_rings[na->num_rx_rings];
2838 q = &kring->rx_queue;
2840 // XXX reconsider long packets if we handle fragments
2841 if (len > NETMAP_BUF_SIZE(na)) { /* too long for us */
2842 D("%s from_host, drop packet size %d > %d", na->name,
2843 len, NETMAP_BUF_SIZE(na));
2847 /* protect against rxsync_from_host(), netmap_sw_to_nic()
2848 * and maybe other instances of netmap_transmit (the latter
2849 * not possible on Linux).
2850 * Also avoid overflowing the queue.
2854 space = kring->nr_hwtail - kring->nr_hwcur;
2856 space += kring->nkr_num_slots;
2857 if (space + mbq_len(q) >= kring->nkr_num_slots - 1) { // XXX
2858 RD(10, "%s full hwcur %d hwtail %d qlen %d len %d m %p",
2859 na->name, kring->nr_hwcur, kring->nr_hwtail, mbq_len(q),
2863 ND(10, "%s %d bufs in queue len %d m %p",
2864 na->name, mbq_len(q), len, m);
2865 /* notify outside the lock */
2874 /* unconditionally wake up listeners */
2875 na->nm_notify(na, na->num_rx_rings, NR_RX, 0);
2876 /* this is normally netmap_notify(), but for nics
2877 * connected to a bridge it is netmap_bwrap_intr_notify(),
2878 * that possibly forwards the frames through the switch
2886 * netmap_reset() is called by the driver routines when reinitializing
2887 * a ring. The driver is in charge of locking to protect the kring.
2888 * If native netmap mode is not set just return NULL.
2890 struct netmap_slot *
2891 netmap_reset(struct netmap_adapter *na, enum txrx tx, u_int n,
2894 struct netmap_kring *kring;
2897 if (!nm_native_on(na)) {
2898 ND("interface not in native netmap mode");
2899 return NULL; /* nothing to reinitialize */
2902 /* XXX note- in the new scheme, we are not guaranteed to be
2903 * under lock (e.g. when called on a device reset).
2904 * In this case, we should set a flag and do not trust too
2905 * much the values. In practice: TODO
2906 * - set a RESET flag somewhere in the kring
2907 * - do the processing in a conservative way
2908 * - let the *sync() fixup at the end.
2911 if (n >= na->num_tx_rings)
2913 kring = na->tx_rings + n;
2914 // XXX check whether we should use hwcur or rcur
2915 new_hwofs = kring->nr_hwcur - new_cur;
2917 if (n >= na->num_rx_rings)
2919 kring = na->rx_rings + n;
2920 new_hwofs = kring->nr_hwtail - new_cur;
2922 lim = kring->nkr_num_slots - 1;
2923 if (new_hwofs > lim)
2924 new_hwofs -= lim + 1;
2926 /* Always set the new offset value and realign the ring. */
2928 D("%s %s%d hwofs %d -> %d, hwtail %d -> %d",
2930 tx == NR_TX ? "TX" : "RX", n,
2931 kring->nkr_hwofs, new_hwofs,
2933 tx == NR_TX ? lim : kring->nr_hwtail);
2934 kring->nkr_hwofs = new_hwofs;
2936 kring->nr_hwtail = kring->nr_hwcur + lim;
2937 if (kring->nr_hwtail > lim)
2938 kring->nr_hwtail -= lim + 1;
2942 /* XXX check that the mappings are correct */
2943 /* need ring_nr, adapter->pdev, direction */
2944 buffer_info->dma = dma_map_single(&pdev->dev, addr, adapter->rx_buffer_len, DMA_FROM_DEVICE);
2945 if (dma_mapping_error(&adapter->pdev->dev, buffer_info->dma)) {
2946 D("error mapping rx netmap buffer %d", i);
2947 // XXX fix error handling
2952 * Wakeup on the individual and global selwait
2953 * We do the wakeup here, but the ring is not yet reconfigured.
2954 * However, we are under lock so there are no races.
2956 na->nm_notify(na, n, tx, 0);
2957 return kring->ring->slot;
2962 * Dispatch rx/tx interrupts to the netmap rings.
2964 * "work_done" is non-null on the RX path, NULL for the TX path.
2965 * We rely on the OS to make sure that there is only one active
2966 * instance per queue, and that there is appropriate locking.
2968 * The 'notify' routine depends on what the ring is attached to.
2969 * - for a netmap file descriptor, do a selwakeup on the individual
2970 * waitqueue, plus one on the global one if needed
2971 * (see netmap_notify)
2972 * - for a nic connected to a switch, call the proper forwarding routine
2973 * (see netmap_bwrap_intr_notify)
2976 netmap_common_irq(struct ifnet *ifp, u_int q, u_int *work_done)
2978 struct netmap_adapter *na = NA(ifp);
2979 struct netmap_kring *kring;
2981 q &= NETMAP_RING_MASK;
2983 if (netmap_verbose) {
2984 RD(5, "received %s queue %d", work_done ? "RX" : "TX" , q);
2987 if (work_done) { /* RX path */
2988 if (q >= na->num_rx_rings)
2989 return; // not a physical queue
2990 kring = na->rx_rings + q;
2991 kring->nr_kflags |= NKR_PENDINTR; // XXX atomic ?
2992 na->nm_notify(na, q, NR_RX, 0);
2993 *work_done = 1; /* do not fire napi again */
2994 } else { /* TX path */
2995 if (q >= na->num_tx_rings)
2996 return; // not a physical queue
2997 kring = na->tx_rings + q;
2998 na->nm_notify(na, q, NR_TX, 0);
3004 * Default functions to handle rx/tx interrupts from a physical device.
3005 * "work_done" is non-null on the RX path, NULL for the TX path.
3007 * If the card is not in netmap mode, simply return 0,
3008 * so that the caller proceeds with regular processing.
3009 * Otherwise call netmap_common_irq() and return 1.
3011 * If the card is connected to a netmap file descriptor,
3012 * do a selwakeup on the individual queue, plus one on the global one
3013 * if needed (multiqueue card _and_ there are multiqueue listeners),
3016 * Finally, if called on rx from an interface connected to a switch,
3017 * calls the proper forwarding routine, and return 1.
3020 netmap_rx_irq(struct ifnet *ifp, u_int q, u_int *work_done)
3022 struct netmap_adapter *na = NA(ifp);
3025 * XXX emulated netmap mode sets NAF_SKIP_INTR so
3026 * we still use the regular driver even though the previous
3027 * check fails. It is unclear whether we should use
3028 * nm_native_on() here.
3030 if (!nm_netmap_on(na))
3033 if (na->na_flags & NAF_SKIP_INTR) {
3034 ND("use regular interrupt");
3038 netmap_common_irq(ifp, q, work_done);
3044 * Module loader and unloader
3046 * netmap_init() creates the /dev/netmap device and initializes
3047 * all global variables. Returns 0 on success, errno on failure
3048 * (but there is no chance)
3050 * netmap_fini() destroys everything.
3053 static struct cdev *netmap_dev; /* /dev/netmap character device. */
3054 extern struct cdevsw netmap_cdevsw;
3060 // XXX destroy_bridges() ?
3062 destroy_dev(netmap_dev);
3065 printf("netmap: unloaded module.\n");
3076 error = netmap_mem_init();
3080 * MAKEDEV_ETERNAL_KLD avoids an expensive check on syscalls
3081 * when the module is compiled in.
3082 * XXX could use make_dev_credv() to get error number
3084 netmap_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD,
3085 &netmap_cdevsw, 0, NULL, UID_ROOT, GID_WHEEL, 0600,
3090 netmap_init_bridges();
3094 printf("netmap: loaded module\n");
3098 return (EINVAL); /* may be incorrect */