参考链接:
http://blog.csdn.net/hejin_some/article/details/72473031
http://blog.csdn.net/bestboyxie/article/details/52984397
http://dpdk.org/doc/guides/prog_guide/mbuf_lib.html#mbuf-library
http://www.cnblogs.com/yhp-smarthome/p/6687175.html
这篇博文其实不算原创,翻译的官方文档+网络博文的摘抄+自己的一点实践经验
0、Direct and Indirect Buffers 介绍 http://dpdk.org/doc/guides/prog_guide/mbuf_lib.html#mbuf-library
一、mbuf核心结构体
struct rte_mbuf {
MARKER cacheline0;
void *buf_addr; /**< Virtual address of segment buffer. */
phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
uint16_t buf_len; /**< Length of segment buffer. */
/* next 6 bytes are initialised on RX descriptor rearm */
MARKER8 rearm_data;
uint16_t data_off;
/**
* 16-bit Reference counter.
* It should only be accessed using the following functions:
* rte_mbuf_refcnt_update(), rte_mbuf_refcnt_read(), and
* rte_mbuf_refcnt_set(). The functionality of these functions (atomic,
* or non-atomic) is controlled by the CONFIG_RTE_MBUF_REFCNT_ATOMIC
* config option.
*/
RTE_STD_C11
union {
rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
uint16_t refcnt; /**< Non-atomically accessed refcnt */
};
uint8_t nb_segs; /**< Number of segments. */
uint8_t port; /**< Input port. */
uint64_t ol_flags; /**< Offload features. */
/* remaining bytes are set on RX when pulling packet from descriptor */
MARKER rx_descriptor_fields1;
/*
* The packet type, which is the combination of outer/inner L2, L3, L4
* and tunnel types. The packet_type is about data really present in the
* mbuf. Example: if vlan stripping is enabled, a received vlan packet
* would have RTE_PTYPE_L2_ETHER and not RTE_PTYPE_L2_VLAN because the
* vlan is stripped from the data.
*/
RTE_STD_C11
union {
uint32_t packet_type; /**< L2/L3/L4 and tunnel information. */
struct {
uint32_t l2_type:4; /**< (Outer) L2 type. */
uint32_t l3_type:4; /**< (Outer) L3 type. */
uint32_t l4_type:4; /**< (Outer) L4 type. */
uint32_t tun_type:4; /**< Tunnel type. */
uint32_t inner_l2_type:4; /**< Inner L2 type. */
uint32_t inner_l3_type:4; /**< Inner L3 type. */
uint32_t inner_l4_type:4; /**< Inner L4 type. */
};
};
uint32_t pkt_len; /**< Total pkt len: sum of all segments. */
uint16_t data_len; /**< Amount of data in segment buffer. */
/** VLAN TCI (CPU order), valid if PKT_RX_VLAN_STRIPPED is set. */
uint16_t vlan_tci;
union {
uint32_t rss; /**< RSS hash result if RSS enabled */
struct {
RTE_STD_C11
union {
struct {
uint16_t hash;
uint16_t id;
};
uint32_t lo;
/**< Second 4 flexible bytes */
};
uint32_t hi;
/**< First 4 flexible bytes or FD ID, dependent on
PKT_RX_FDIR_* flag in ol_flags. */
} fdir; /**< Filter identifier if FDIR enabled */
struct {
uint32_t lo;
uint32_t hi;
} sched; /**< Hierarchical scheduler */
uint32_t usr; /**< User defined tags. See rte_distributor_process() */
} hash; /**< hash information */
uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
/** Outer VLAN TCI (CPU order), valid if PKT_RX_QINQ_STRIPPED is set. */
uint16_t vlan_tci_outer;
/* second cache line - fields only used in slow path or on TX */
MARKER cacheline1 __rte_cache_min_aligned;
RTE_STD_C11
union {
void *userdata; /**< Can be used for external metadata */
uint64_t udata64; /**< Allow 8-byte userdata on 32-bit */
};
struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
struct rte_mbuf *next; /**< Next segment of scattered packet. */
/* fields to support TX offloads */
RTE_STD_C11
union {
uint64_t tx_offload; /**< combined for easy fetch */
__extension__
struct {
uint64_t l2_len:7;
/**< L2 (MAC) Header Length for non-tunneling pkt.
* Outer_L4_len + ... + Inner_L2_len for tunneling pkt.
*/
uint64_t l3_len:9; /**< L3 (IP) Header Length. */
uint64_t l4_len:8; /**< L4 (TCP/UDP) Header Length. */
uint64_t tso_segsz:16; /**< TCP TSO segment size */
/* fields for TX offloading of tunnels */
uint64_t outer_l3_len:9; /**< Outer L3 (IP) Hdr Length. */
uint64_t outer_l2_len:7; /**< Outer L2 (MAC) Hdr Length. */
/* uint64_t unused:8; */
};
};
/** Size of the application private data. In case of an indirect
* mbuf, it stores the direct mbuf private data size. */
uint16_t priv_size;
/** Timesync flags for use with IEEE1588. */
uint16_t timesync;
} __rte_cache_aligned;
二、图示mbuf结构体
上图是只有一个segment的mbuf结构体图示
rte_pktmbuf_mtod(m):得到data的首地址
headroom and tailroom
data数据的长度:rte_pktmbuf_pktlen 或者 rte_pktmbuf_datalen
mbuf结构体中的pkt的next字段记录下一个segment的地址
m的pkt总长度是seg1+seg2+seg3三段数据之和。
创建新mbuf,只包括一个segment,length = 0
释放mbuf时,是将mbuf归还给内存池使用。
When freeing a packet mbuf that contains several segments, all of them are freed and returned to their original mempool.
如果释放一个包含多个segment的mbuf结构体,其中的每个segment(其实也是mbuf*)都会被释放,然后回归到原始内存池中
三、mbuf和零拷贝实现原理
四、mbuf何时释放
五、mbuf的基本操作
Rte_mbuf的结构与linux内核协议栈的skb_buf相似,在保存报文的内存块前后分别保留headroom和tailroom,以方便应用解封报文。Headroom默认128字节,可以通过宏RTE_PKTMBUF_HEADROOM调整。
5.1 计算出mbuf中各个字段的长度
我们可以通过m->pkt.data – m->buf_addr计算出headroom长度,通过m->buf_len – m->pkt.data_len – headroom_size计算出tailroom长度。这些计算过程都由以下函数实现:
uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
5.2 解封装报文头部
假设m->pkt.data指向报文的二层首地址,我们可以通过以下一系列操作剥去报文的二层头部:
m->pkt.data += 14;
m->pkt.data_len -= 14;
m->pkt.pkt_len -= 14;
这些操作已经由rte_pktmbuf_adj()实现,函数原型如下:
char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
5.3 封装报文头部
我们可以通过以下一系列操作为IP报文封装二层头部:
m->pkt.data -= 14;
m->pkt.data_len += 14;
m->pkt.pkt_len += 14;
这些操作由rte_pktmbuf_prepend()实现,函数原型如下:
char *rte_pktmbuf_prepend(struct rte_mbuf *m, uint16_t len)
5.3 在尾部tailroom添加数据
如果需要在tailroom 中加入N个字节数据,我们可以通过以下操作完成:
tail = m->pkt.data + m->pkt.data_len; // tail记录tailroom首地址
m->pkt.data_len += N;
m->pkt.pkt_len += N;
这些操作由rte_pktmbuf_append()实现,函数原型如下:
char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
5.3 从data尾部删除数据
librte_mbuf还提供了rte_pktmbuf_trim()函数,用来移除mbuf中data数据域的最后N个字节,函数实现如下:
m->pkt.data_len -= N;
m->pkt.pkt_len -= N;
函数原型如下:
int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
重点:
一 . 报文数据永远是存放在data数据域中的;主要控制的就是data_off 与data_len,
data_off + buf_addr = data数据的开始地址
data_off + buf_addr = data数据的结束地址
二 . pkt_len和data_len的关系
uint32_t pkt_len;/**< Total pkt len: sum of all segments. */
uint16_t data_len;/**< Amount of data in segment buffer. */
如果只有一个mbuf,则pkt_len和data_len是相同值.
三 api 应用场景
rte_pktmbuf_prepend
移动data_off指针,注意:需要查看返回值,如果已经偏移到headroom的时候,会返回NULL;(报文向前扩容),例如报文从应用层往下,一层一层的封装就用这个。
rte_pktmbuf_append
改变data_len的长度 ,返回改变前的尾地址。(向后扩容)
例如先有首部再填数据字段,就可以用这个
rte_pktmbuf_adj
(首部向后缩小空间) 改变data_off的值 从二层到三层转发,去二层头就可以用这个
rte_pktmbuf_trim
(尾部向前缩小空间) 移动data_len减少buf_len;(预分配的内容太大,数据没那么大可以用这个)
总结:
这4个API就是我们常见的调整数据部分大小,其实用法和API的名字和内核的skbuf类似。
rte_pktmbuf_mtod
rte_pktmbuf_mtod_offset
#define rte_pktmbuf_mtod_offset(m, t, o) \
((t)((char *)(m)->buf_addr + (m)->data_off + (o)))
#define rte_pktmbuf_mtod(m, t) rte_pktmbuf_mtod_offset(m, t, 0)
这两个API就是就是返回buf_addr+data_off +useroff 然后再强制类型转换一下而已~~
学习知识要扎实,一步步完善自己知识体系.,dpdk的源代码还是很值得学习的.