dpdk学习之mbuf结构体

时间:2020-11-29 22:28:02

参考链接:
http://blog.csdn.net/hejin_some/article/details/72473031
http://blog.csdn.net/bestboyxie/article/details/52984397
http://dpdk.org/doc/guides/prog_guide/mbuf_lib.html#mbuf-library
http://www.cnblogs.com/yhp-smarthome/p/6687175.html
这篇博文其实不算原创,翻译的官方文档+网络博文的摘抄+自己的一点实践经验

0、Direct and Indirect Buffers 介绍 http://dpdk.org/doc/guides/prog_guide/mbuf_lib.html#mbuf-library

一、mbuf核心结构体

struct rte_mbuf {
MARKER cacheline0;

void *buf_addr; /**< Virtual address of segment buffer. */
phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */

uint16_t buf_len; /**< Length of segment buffer. */

/* next 6 bytes are initialised on RX descriptor rearm */
MARKER8 rearm_data;
uint16_t data_off;

/**
* 16-bit Reference counter.
* It should only be accessed using the following functions:
* rte_mbuf_refcnt_update(), rte_mbuf_refcnt_read(), and
* rte_mbuf_refcnt_set(). The functionality of these functions (atomic,
* or non-atomic) is controlled by the CONFIG_RTE_MBUF_REFCNT_ATOMIC
* config option.
*/

RTE_STD_C11
union {
rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
uint16_t refcnt; /**< Non-atomically accessed refcnt */
};
uint8_t nb_segs; /**< Number of segments. */
uint8_t port; /**< Input port. */

uint64_t ol_flags; /**< Offload features. */

/* remaining bytes are set on RX when pulling packet from descriptor */
MARKER rx_descriptor_fields1;

/*
* The packet type, which is the combination of outer/inner L2, L3, L4
* and tunnel types. The packet_type is about data really present in the
* mbuf. Example: if vlan stripping is enabled, a received vlan packet
* would have RTE_PTYPE_L2_ETHER and not RTE_PTYPE_L2_VLAN because the
* vlan is stripped from the data.
*/

RTE_STD_C11
union {
uint32_t packet_type; /**< L2/L3/L4 and tunnel information. */
struct {
uint32_t l2_type:4; /**< (Outer) L2 type. */
uint32_t l3_type:4; /**< (Outer) L3 type. */
uint32_t l4_type:4; /**< (Outer) L4 type. */
uint32_t tun_type:4; /**< Tunnel type. */
uint32_t inner_l2_type:4; /**< Inner L2 type. */
uint32_t inner_l3_type:4; /**< Inner L3 type. */
uint32_t inner_l4_type:4; /**< Inner L4 type. */
};
};

uint32_t pkt_len; /**< Total pkt len: sum of all segments. */
uint16_t data_len; /**< Amount of data in segment buffer. */
/** VLAN TCI (CPU order), valid if PKT_RX_VLAN_STRIPPED is set. */
uint16_t vlan_tci;

union {
uint32_t rss; /**< RSS hash result if RSS enabled */
struct {
RTE_STD_C11
union {
struct {
uint16_t hash;
uint16_t id;
};
uint32_t lo;
/**< Second 4 flexible bytes */
};
uint32_t hi;
/**< First 4 flexible bytes or FD ID, dependent on
PKT_RX_FDIR_* flag in ol_flags. */

} fdir; /**< Filter identifier if FDIR enabled */
struct {
uint32_t lo;
uint32_t hi;
} sched; /**< Hierarchical scheduler */
uint32_t usr; /**< User defined tags. See rte_distributor_process() */
} hash; /**< hash information */

uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */

/** Outer VLAN TCI (CPU order), valid if PKT_RX_QINQ_STRIPPED is set. */
uint16_t vlan_tci_outer;

/* second cache line - fields only used in slow path or on TX */
MARKER cacheline1 __rte_cache_min_aligned;

RTE_STD_C11
union {
void *userdata; /**< Can be used for external metadata */
uint64_t udata64; /**< Allow 8-byte userdata on 32-bit */
};

struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
struct rte_mbuf *next; /**< Next segment of scattered packet. */

/* fields to support TX offloads */
RTE_STD_C11
union {
uint64_t tx_offload; /**< combined for easy fetch */
__extension__
struct {
uint64_t l2_len:7;
/**< L2 (MAC) Header Length for non-tunneling pkt.
* Outer_L4_len + ... + Inner_L2_len for tunneling pkt.
*/

uint64_t l3_len:9; /**< L3 (IP) Header Length. */
uint64_t l4_len:8; /**< L4 (TCP/UDP) Header Length. */
uint64_t tso_segsz:16; /**< TCP TSO segment size */

/* fields for TX offloading of tunnels */
uint64_t outer_l3_len:9; /**< Outer L3 (IP) Hdr Length. */
uint64_t outer_l2_len:7; /**< Outer L2 (MAC) Hdr Length. */

/* uint64_t unused:8; */
};
};

/** Size of the application private data. In case of an indirect
* mbuf, it stores the direct mbuf private data size. */

uint16_t priv_size;

/** Timesync flags for use with IEEE1588. */
uint16_t timesync;
} __rte_cache_aligned;

二、图示mbuf结构体

dpdk学习之mbuf结构体
上图是只有一个segment的mbuf结构体图示

rte_pktmbuf_mtod(m):得到data的首地址
headroom and tailroom
data数据的长度:rte_pktmbuf_pktlen 或者 rte_pktmbuf_datalen

dpdk学习之mbuf结构体
mbuf结构体中的pkt的next字段记录下一个segment的地址
m的pkt总长度是seg1+seg2+seg3三段数据之和。

创建新mbuf,只包括一个segment,length = 0
释放mbuf时,是将mbuf归还给内存池使用。

When freeing a packet mbuf that contains several segments, all of them are freed and returned to their original mempool.
如果释放一个包含多个segment的mbuf结构体,其中的每个segment(其实也是mbuf*)都会被释放,然后回归到原始内存池中

三、mbuf和零拷贝实现原理

四、mbuf何时释放

五、mbuf的基本操作

Rte_mbuf的结构与linux内核协议栈的skb_buf相似,在保存报文的内存块前后分别保留headroom和tailroom,以方便应用解封报文。Headroom默认128字节,可以通过宏RTE_PKTMBUF_HEADROOM调整。

5.1 计算出mbuf中各个字段的长度
我们可以通过m->pkt.data – m->buf_addr计算出headroom长度,通过m->buf_len – m->pkt.data_len – headroom_size计算出tailroom长度。这些计算过程都由以下函数实现:

uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)

5.2 解封装报文头部
假设m->pkt.data指向报文的二层首地址,我们可以通过以下一系列操作剥去报文的二层头部:

m->pkt.data += 14;
m->pkt.data_len -= 14;
m->pkt.pkt_len -= 14;

这些操作已经由rte_pktmbuf_adj()实现,函数原型如下:
char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)

5.3 封装报文头部
我们可以通过以下一系列操作为IP报文封装二层头部:

m->pkt.data -= 14;
m->pkt.data_len += 14;
m->pkt.pkt_len += 14;

这些操作由rte_pktmbuf_prepend()实现,函数原型如下:
char *rte_pktmbuf_prepend(struct rte_mbuf *m, uint16_t len)

5.3 在尾部tailroom添加数据
如果需要在tailroom 中加入N个字节数据,我们可以通过以下操作完成:
tail = m->pkt.data + m->pkt.data_len; // tail记录tailroom首地址
m->pkt.data_len += N;
m->pkt.pkt_len += N;

这些操作由rte_pktmbuf_append()实现,函数原型如下:
char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)

5.3 从data尾部删除数据
librte_mbuf还提供了rte_pktmbuf_trim()函数,用来移除mbuf中data数据域的最后N个字节,函数实现如下:

m->pkt.data_len -= N;
m->pkt.pkt_len -= N;

函数原型如下:
int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)

重点:
一 . 报文数据永远是存放在data数据域中的;主要控制的就是data_off 与data_len,
data_off + buf_addr = data数据的开始地址
data_off + buf_addr = data数据的结束地址

二 . pkt_len和data_len的关系

uint32_t pkt_len;/**< Total pkt len: sum of all segments. */
uint16_t data_len;/**< Amount of data in segment buffer. */

如果只有一个mbuf,则pkt_len和data_len是相同值.

三 api 应用场景

rte_pktmbuf_prepend
移动data_off指针,注意:需要查看返回值,如果已经偏移到headroom的时候,会返回NULL;(报文向前扩容),例如报文从应用层往下,一层一层的封装就用这个。

rte_pktmbuf_append
改变data_len的长度 ,返回改变前的尾地址。(向后扩容)
例如先有首部再填数据字段,就可以用这个

rte_pktmbuf_adj

(首部向后缩小空间) 改变data_off的值 从二层到三层转发,去二层头就可以用这个

rte_pktmbuf_trim
(尾部向前缩小空间) 移动data_len减少buf_len;(预分配的内容太大,数据没那么大可以用这个)

总结:
这4个API就是我们常见的调整数据部分大小,其实用法和API的名字和内核的skbuf类似。

rte_pktmbuf_mtod
rte_pktmbuf_mtod_offset

#define rte_pktmbuf_mtod_offset(m, t, o) \
((t)((char *)(m)->buf_addr + (m)->data_off + (o)))

#define rte_pktmbuf_mtod(m, t) rte_pktmbuf_mtod_offset(m, t, 0)

这两个API就是就是返回buf_addr+data_off +useroff 然后再强制类型转换一下而已~~

学习知识要扎实,一步步完善自己知识体系.,dpdk的源代码还是很值得学习的.