L3---fragment: 3层的分片/重组机制

#
# This doc describe:
#
# _regular_ L3 (ipv4, ipv6) fragment handling # including: "fragmentation" - frag, "defragmentation" - defrag
#
# And various networking module: _special_ handling to L3 fragments. # like netfilter, ipsec
#

===============================================================================================================

--- index:

---------------------------------------------------------------------------------------------------------------

=> terms: "original packet", "fragment packet", "fragmentation" - frag, "defragmentation" - defrag

=> <quote> from ipv6 rfc <<rfc2460---Internet.Protocol.Version.6.(IPv6).Specification.txt>> / 4.5 Fragment Header

---------------------------------------------------------------------------------------------------------------

=> regular L3 fragment handling

=> frag # fast path / slow path

=> fast path

=> slow path

=> ipv4 router might further frag to forwarding ipv4 fragment, but ipv6 router not.

=> retransmission # always a "original packet", so for all "fragment packet" of a "original packet", not for a single "fragment packet"

------------------------------------------------------

=> defrag

=> defrag logic # from local delivery path.

=> reassemly magic # new ingress "fragment packet", _become_ "original packet"

---------------------------------------------------------------------------------------------------------------

=> netfilter: fragment handling # <symlink> to: <<L3---fragment---netfilter.txt>>

---------------------------------------------------------------------------------------------------------------

=> ipsec: fragment handling # <symlink> to: <<ipsec---fragment.txt>>

---------------------------------------------------------------------------------------------------------------

===============================================================================================================

@@ terms: "original packet", "fragment packet", "fragmentation" - frag, "defragmentation" - defrag

#
# These terms are defined by IPv6 rfc.
#
#
# Same definitions and rules also apply to IPv4 # --- it is just ipv6 RFC description is more clear.
#

===============================================================================================================

@@-@ <quote> from ipv6 rfc <<rfc2460---Internet.Protocol.Version.6.(IPv6).Specification.txt>> / 4.5 Fragment Header

The initial, large, unfragmented packet is referred to as the
"original packet", and it is considered to consist of two parts, as
illustrated:

original packet:

+------------------+----------------------//-----------------------+
| Unfragmentable | Fragmentable |
| Part | Part |
+------------------+----------------------//-----------------------+

The Unfragmentable Part consists of the IPv6 header plus any
extension headers that must be processed by nodes en route to the
destination, that is, all headers up to and including the Routing
header if present, else the Hop-by-Hop Options header if present,
else no extension headers.

The Fragmentable Part consists of the rest of the packet, that is,
any extension headers that need be processed only by the final
destination node(s), plus the upper-layer header and data.

The Fragmentable Part of the original packet is divided into
fragments, each, except possibly the last ("rightmost") one, being an
integer multiple of 8 octets long. The fragments are transmitted in
separate "fragment packets" as illustrated:

original packet:

+------------------+--------------+--------------+--//--+----------+
| Unfragmentable | first | second | | last |
| Part | fragment | fragment | .... | fragment |
+------------------+--------------+--------------+--//--+----------+

fragment packets:

+------------------+--------+--------------+
| Unfragmentable |Fragment| first |
| Part | Header | fragment |
+------------------+--------+--------------+

+------------------+--------+--------------+
| Unfragmentable |Fragment| second |
| Part | Header | fragment |
+------------------+--------+--------------+
o
o
o
+------------------+--------+----------+
| Unfragmentable |Fragment| last |
| Part | Header | fragment |
+------------------+--------+----------+

Each fragment packet is composed of:

(1) The Unfragmentable Part of the original packet, with the
Payload Length of the original IPv6 header changed to contain
the length of this fragment packet only (excluding the length
of the IPv6 header itself), and the Next Header field of the
last header of the Unfragmentable Part changed to 44.

(2) A Fragment header containing:

The Next Header value that identifies the first header of
the Fragmentable Part of the original packet.

A Fragment Offset containing the offset of the fragment,
in 8-octet units, relative to the start of the
Fragmentable Part of the original packet. The Fragment
Offset of the first ("leftmost") fragment is 0.

An M flag value of 0 if the fragment is the last
("rightmost") one, else an M flag value of 1.

The Identification value generated for the original
packet.

(3) The fragment itself.

The lengths of the fragments must be chosen such that the resulting
fragment packets fit within the MTU of the path to the packets'
destination(s).

At the destination, fragment packets are reassembled into their
original, unfragmented form, as illustrated:

reassembled original packet:

+------------------+----------------------//------------------------+
| Unfragmentable | Fragmentable |
| Part | Part |
+------------------+----------------------//------------------------+

The following rules govern reassembly:

An original packet is reassembled only from fragment packets that
have the same Source Address, Destination Address, and Fragment
Identification.

The Unfragmentable Part of the reassembled packet consists of all
headers up to, but not including, the Fragment header of the first
fragment packet (that is, the packet whose Fragment Offset is
zero), with the following two changes:

The Next Header field of the last header of the Unfragmentable
Part is obtained from the Next Header field of the first
fragment's Fragment header.

The Payload Length of the reassembled packet is computed from
the length of the Unfragmentable Part and the length and offset
of the last fragment. For example, a formula for computing the
Payload Length of the reassembled original packet is:

PL.orig = PL.first - FL.first - 8 + (8 * FO.last) + FL.last

where
PL.orig = Payload Length field of reassembled packet.
PL.first = Payload Length field of first fragment packet.
FL.first = length of fragment following Fragment header of
first fragment packet.
FO.last = Fragment Offset field of Fragment header of
last fragment packet.
FL.last = length of fragment following Fragment header of
last fragment packet.

The Fragmentable Part of the reassembled packet is constructed
from the fragments following the Fragment headers in each of the
fragment packets. The length of each fragment is computed by
subtracting from the packet's Payload Length the length of the
headers between the IPv6 header and fragment itself; its relative
position in Fragmentable Part is computed from its Fragment Offset
value.

The Fragment header is not present in the final, reassembled
packet.

The following error conditions may arise when reassembling fragmented
packets:

If insufficient fragments are received to complete reassembly of a
packet within 60 seconds of the reception of the first-arriving
fragment of that packet, reassembly of that packet must be
abandoned and all the fragments that have been received for that
packet must be discarded. If the first fragment (i.e., the one
with a Fragment Offset of zero) has been received, an ICMP Time
Exceeded -- Fragment Reassembly Time Exceeded message should be
sent to the source of that fragment.

If the length of a fragment, as derived from the fragment packet's
Payload Length field, is not a multiple of 8 octets and the M flag
of that fragment is 1, then that fragment must be discarded and an
ICMP Parameter Problem, Code 0, message should be sent to the
source of the fragment, pointing to the Payload Length field of
the fragment packet.

If the length and offset of a fragment are such that the Payload
Length of the packet reassembled from that fragment would exceed
65,535 octets, then that fragment must be discarded and an ICMP
Parameter Problem, Code 0, message should be sent to the source of
the fragment, pointing to the Fragment Offset field of the
fragment packet.

The following conditions are not expected to occur, but are not
considered errors if they do:

The number and content of the headers preceding the Fragment
header of different fragments of the same original packet may
differ. Whatever headers are present, preceding the Fragment
header in each fragment packet, are processed when the packets
arrive, prior to queueing the fragments for reassembly. Only
those headers in the Offset zero fragment packet are retained in
the reassembled packet.

The Next Header values in the Fragment headers of different
fragments of the same original packet may differ. Only the value
from the Offset zero fragment packet is used for reassembly.

===============================================================================================================

@@ regular L3 fragment handling

#
# By "regular L3 fragment handling", we mean: no other special networking module, such as netfilter, ipsec, etc.
#
#
# As for linux implementation, ipv4 and ipv6 fragmentation / defragmentation is quite similar.
#
# --- For defragmentation, there is even a common $(defrag) framework.
#
#
# The following description is common to ipv4 and ipv6. # but more detailed and specific description, see those reference.
#

===============================================================================================================

@@-@ frag # fast path / slow path

function:

ipv4: int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))

ipv6: int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))

===============================================================================================================

@@-@-@ fast path

Upper L4 layer would be aware of possible fragmentation effort of L3, and want to do some help to
ease the L3 fragmentation effort.

So when L4 is pushing egress data into its pending egress "original packet", would try to organize its
egress "original packet" skb in a certain format which is proper for quick fragmentation logic. The format
is like this:

sk_buff: # its data area is like below

| main data space | # containing L4 header, and a small portion of L4 payload.
#
# The main data space valid content size is PMTU ( kept in socket )

| children page fragments | = NULL # "skb_shared_info->frags" = NULL no page fragments

| children skb fragments | -> child skb #1 -> child skb #2 -> ... -> child skb #N

#
# "skb_shared_info->frag_list" != NULL
#
# Each child skb fragment, contains futher portions L4 payload.
#
# --- normally, each child skb fragment, is a linear skb, with only main data space.
# and its valid content ( L4 payload portion) size is PMTU.
#
#
# --- As L4 is pushing egress data into pending "original packet", it internally
# allocate child skb, and copy egress data into child skb, and append it into
# skb fragment list.
#
#
# --- The data of child skb fragments, is considered as part of its parent skb.
# account in parent skb "sk_buff->len"

representing as a figure:

---------------------- ----------------------- ----------------------- -----------------------
| parent skb | | child skb #1 | | child skb #2 | | child skb #N |
| | | | | | | |
| main data space | | main data space | | main data space | | main data space |
| size =PMTU | | size =PMTU | | size =PMTU | | size =PMTU |
| | ---> | | ---> | | ---> ... ---> | |
| L4 header | | | | | | |
| L4 payload portion | | L4 payload portion | | L4 payload portion | | L4 payload portion |
| | | | | | | |
---------------------- ----------------------- ----------------------- -----------------------

Then, the L3 fragmentation function would check if a egress "original packet" skb is suitable for fast path, mainly on following 2 aspects:

of the above format

the main data space length of parent skb and every child skb fragments is less than PMTU.

If OK, then fast path is performed this way:

Construct L3 header, for each skb ( parent skb, and every child skb fragments ).
Since they (parent skb, and every child skb fragments ) have L3 headers now, they become "fragment packet".

Unlink all child skb fragments from their parent.

#
# After child skb fragments get unlinked, they are independent from their parent, and they data
# are no longer considered as part of its parent skb, not account in parent skb "sk_buff->len".
#

a figure is like this:

now become:

L3 "fragment packet #0" "fragment packet #1" "fragment packet #2" "fragment packet #N"

---------------------- ----------------------- ----------------------- -----------------------
| old parent skb | | old child skb #1 | | old child skb #2 | | old child skb #N |
| | | | | | | |
| main data space | | main data space | | main data space | | main data space |
| size =PMTU | | size =PMTU | | size =PMTU | | size =PMTU |
| | -X-> | | -X-> | | -X-> ... -X-> | |
| L3 header | /\ | | | | | |
| L4 header | | | L3 header | | L3 header | | L3 header |
| L4 payload portion | | | L4 payload portion | | L4 payload portion | | L4 payload portion |
| | | | | | | | |
---------------------- | ----------------------- ----------------------- -----------------------
|
|---- X means "unlink"

--- We see, there is no new skb allocation and memory copy in this logic, so it fast, so called "fast path".

===============================================================================================================

@@-@-@ slow path

If L3 fragmentation function check if a egress "original packet" skb is _NOT_ suitable for fast path, like:

not of fast path format # eg. a big linear skb.

or, of that format, but length check not pass for PMTU

then, slow path would be performed to fragmentation, simply:

Looply,

allocate a new "sk_buff" with PMTU length.

copy data content (L4 header + L4 payload) from "original packet" skb, into new "sk_buff".

construct L3 header, for each new "sk_buff", making them "fragment packet"

Until data content of "original packet" skb is copied complete.

--- We see, there are some new skb allocation and memory copy in this logic, so it is slower than "fast path", so called "slow path".

===============================================================================================================

@@-@-@ ipv4 router might further frag to forwarding ipv4 fragment, but ipv6 router not.

Normally, we think fragmentation happens at source node.

But ipv4 router could further frag a forwarding ipv4 fragment.

For example.

---> "fragment packet" --->

| source node | ---> | router #a | ---> | router #b | ---> | router #c | ---> | destination node |

At first, "fragment packet" size is less than PMTU, so it can pass through.

Then suddenly, the MTU ( note, not PMTU ) between router #b to router #c, decrease to be less than "fragment packet"
size. Such as:

CLI "interface router_#b_interface_to_router_#c mtu from 1500 to 1000"

or routing protocol on router #b select another router #d as its next hop

In this case, the following things would happen:

#1. If 'DF' ( don't fragment ) bit in IPv4 header of this "fragment packet" is SET, then it doesn't allow
futher fragmentation by intermediate router.

In this case, router would send back a ICMP error message ( TooBig ) to source node, for PMTU logic.

#2. Otherwise, depending on router configuration.

If set router NOT futher fragmentation, then like #1, send back ICMP error message ( TooBig ).

If set router to fo futher fragmentation, then:

IPv4 fragmentation function would call to this "fragment packet", # by router.
make it into 2 small "fragment packets". # and normally, fragmentation "slow path"

---------------------------------------------------------------------------------------------------------------

--- ipv4 code:

"dst_entry->input()" = ip_forward(skb) # ip4 forward path

#
# If forwarding ip4 packet (might be a large fragment ) exceed next hop MTU, then only if:
#
# DF "Don't Fragment" bit is set in ip4 header # explicitly not allow perform fragmentation
#
# Then send back icmp4 ( ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED ) message to source node.
#
#
# Otherwise, if DF bit is not set, ip4 forward path would continue to reach ip_finish_output(),
# there it would perform further fragmentation to the large fragment.
#

if (unlikely(skb->len > mtu && !skb_is_gso(skb) &&
(ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {

IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);

icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(mtu));
goto drop;
}

ip_forward_finish()

dst_output(skb); = "dst_entry->output()" = ip_output()

ip_finish_output()

if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2); # further fragmenetation happens here.
else
return ip_finish_output2(skb);

---------------------------------------------------------------------------------------------------------------

Under the same case, ipv6 router is always NOT allowed to futher fragmentation a large ipv6 "fragment packet".

This is specified by RFC:

<<rfc2460>> / 4.5 Fragment Header

(Note: unlike IPv4, fragmentation in IPv6 is performed only by source nodes, not by
routers along a packet's delivery path -- see section 5.)

[tip] Well, implementation is always different from specification, exception exist as investigating netfilter and ipsec.

see <<L3---fragment---netfilter.txt>>

---------------------------------------------------------------------------------------------------------------

--- ipv6 code:

"dst_entry->input()" = ip6_forward() # ip6 forward path

#
# These are the same check conditions like in ip6_output() for calling ip6_fragment().
#
#
# Because for ip6, fragmentation is only performed by source node. And intermiedate routers
# should not perform further fragmentation to a large fragment.
#
#
# So here, if "skb->len" > mtu, we simply send back icmp6 "Packet Too Big Message" and return,
# no chance to reach ip6_fragment() frag logic in ip6_finish_output()
#

if ((!skb->local_df && skb->len > mtu && !skb_is_gso(skb)) ||
(IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)) {

/* Again, force OUTPUT device used as source address */
skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
IP6_INC_STATS_BH(net, ip6_dst_idev(dst),
IPSTATS_MIB_INTOOBIGERRORS);
IP6_INC_STATS_BH(net, ip6_dst_idev(dst),
IPSTATS_MIB_FRAGFAILS);
kfree_skb(skb);
return -EMSGSIZE;

}

ip6_forward_finish()

dst_output(skb); = "dst_entry->output()" = ip6_output(skb)

ip6_finish_output(skb)

#
# frag logic happen here.
#

if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
dst_allfrag(skb_dst(skb)) ||
(IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
return ip6_fragment(skb, ip6_finish_output2);

---------------------------------------------------------------------------------------------------------------

===============================================================================================================

@@-@-@ retransmission # always a "original packet", so for all "fragment packet" of a "original packet", not for a single "fragment packet"

packet loss is common in network.

Some "fragment packet" get lost in network path, making "original packet" can not be defragmented in destination node.

Retransmission is performed by L4 layer of source node.

L4 layer (like TCP) consider the whole "original packet" as lost, and retransmit it. # L4 doesn't know lost "fragment packet".

The retransmitted "original packet" would go through fragmentation logic, AGAIN.

===============================================================================================================

@@-@ defrag

ipv4: int ip_defrag(struct sk_buff *skb, u32 user)

ipv6: "Fragment Header" extension header handler - "frag_protocol" "->handler()" = ipv6_frag_rcv()

===============================================================================================================

@@-@-@ defrag logic # from local delivery path.

#
# local delivery path:
#
# packet destination address is one unicast L3 address configured to local node, or multicast L3 address interested by
# local node.
#
# Then ingress routing lookup should determine that this ingress packet should be uploaded to L4 layer.
#
#
# If this ingress packet is a "fragment packet", then it should get into defrag logic first.
#
# There might be 2 result back from defrag logic:
#
# #1. Have not received all "fragment packet" for a "original packet", then this "fragment packet"
# is enqueued into some internal queue, waiting for other "fragment packet" comming in.
#
# In this case, this "fragment packet" is considered as taken by defrag logic,
# and its handling is stopped temporarily.
#
#
# #2. This ingress "fragment packet" make the pending internal queue complete.
#
# --- Note that, "fragment packet"s might arrive out of order, so it might not be the
# last "fragment packet", but could be any intermediate or 1st "fragment packet".
#
# In this case, reassemly operation is performed.
#
# [*] And reassemly operation plays a magic:
#
# make the ingress "fragment packet" skb, become the "original packet".
#
# Then, continue to upload the reassemblied "original packet" to L4 RX handler.
#

---------------------------------------------------------------------------------------------------------------

--- ipv4 case:

netif_receive_skb() # L2->L3 RX interface

"ip_packet_type->func" = ip_rcv() -> ip_rcv_finish()

#
# ingress routing lookup.
#
# then "skb_dst(skb)->input" would be set to:
#
# #1. ip_local_deliver() # if packet dst_addr is UC address configured on local node.
#
# #2. ip_mr_input() # if packet dst_addr is MC address, then it would be:
# # #a. if multicast routing, then forwarded by ip_mr_forward() with a skb clone.
# # #b. then, if interested by local node, call to ip_local_deliver().
#

ip_route_input_noref(skb, iph->daddr, iph->saddr,
iph->tos, skb->dev);

dst_input(skb) = ip_local_deliver()

#
# defrag logic.
#

if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
return 0;
}

return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
ip_local_deliver_finish);


---------------------------------------------------------------------------------------------------------------

--- ipv6 case:

netif_receive_skb() # L2->L3 RX interface

"ipv6_packet_type->func()" = ipv6_rcv() -> ip6_rcv_finish()

#
# ingress routing lookup.
#
# then "skb_dst(skb)->input" would be set to:
#
# #1. ip6_input() # if packet dst_addr is UC address configured on local node.
#
# #2. ip6_mc_input() # if packet dst_addr is MC address, then it would be:
# # #a. if multicast routing, then forwarded by ip6_mr_forward() with a skb clone.
# # #b. then, if interested by local node, call to ip6_input().

if (!skb_dst(skb))
ip6_route_input(skb);

/*
* Deliver IP Packets to the higher protocol layers.
*/

dst_input(skb) = ip6_input() -> ip6_input_finish()

resubmit:
nhoff = IP6CB(skb)->nhoff;
nexthdr = skb_network_header(skb)[nhoff];

if ((ipprot = rcu_dereference(inet6_protos[nexthdr])) != NULL) {

ret = ipprot->handler(skb);
if (ret > 0)
goto resubmit;

------------------------------------------------------

#
# defrag logic:
#
# "Fragment Header" extension header handler - "frag_protocol" "->handler()" = ipv6_frag_rcv()
#

------------------------------------------------------
}
else {
#
# release the ingress packet.
#
kfree_skb(skb); / consume_skb(skb);
}

===============================================================================================================

@@-@-@ reassemly magic # new ingress "fragment packet", _become_ "original packet"

function:

ipv4: ip_frag_reasm()

ipv6: ip6_frag_reasm()

All pending "fragment packet" of a "original packet", are organized into a internal queue structure.

struct inet_frag_queue {

#
# Linked by "sk_buff->next" field.
#

struct sk_buff *fragments; /* list of received fragments */
}

A figure of pending "fragment packet" is like:

---------------------- ----------------------- ----------------------- -----------------------
| "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" |
| #0 | ---> | #1 | ---> | #3 | ---> ... ---> | #N |
| | | | | | | |
---------------------- ----------------------- ----------------------- -----------------------

If a new ingress "fragment packet" make the pending "original packet" queue complete, then reassemly plays its magic to
get the "original packet".

------------------------------------------------------

#1. Intentionally, _NOT_ link new ingress "fragment packet" into "inet_frag_queue->fragments" list.

----------------------
| "fragment packet" | # call it as "fragment_packet_new_in"
| #2 |
| |
----------------------

---------------------- ----------------------- ----------------------- -----------------------
| "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" |
| #0 | ---> | #1 | ---> | #3 | ---> ... ---> | #N |
| | | | | | | |
---------------------- ----------------------- ----------------------- -----------------------

------------------------------------------------------

#2. create a skb clone to "fragment_packet_new_in"

---------------------- ----------------------
| "fragment packet" | skb_clone() | clone of |
| #2 | create a skb clone | "fragment packet"| # call it as "fragment_packet_new_in_CLONE"
| | | #2 |
---------------------- ----------------------
| |
| |
------ share ----------------------------- share ----
|
|
\/
-----------------------------
| data area |
| |
| content is |
| L4 payload portion |
-----------------------------

------------------------------------------------------

#3. Link "fragment_packet_new_in_CLONE" ( not "fragment_packet_new_in" ) into "inet_frag_queue->fragments"

#
# In fact, this is
# "fragment_packet_new_in_CLONE"
#

---------------------- ----------------------- ----------------------- ----------------------- -----------------------
| "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" |
| #0 | ---> | #1 | ---> | #2 | ---> | #3 | ---> ... ---> | #N |
| | | | | | | | | |
---------------------- ----------------------- ----------------------- ----------------------- -----------------------


------------------------------------------------------

#4. Make "fragment_packet_new_in" _discard_ its data area, so "fragment_packet_new_in_CLONE" now has
the exclusive ownership to the data area.

---------------------------- ----------------------------------
| "fragment_packet_new_in" | | "fragment_packet_new_in_CLONE" |
| | | |
| | | |
---------------------------- ----------------------------------
| |
| |
---- _discard_ ---- exclusive ownership ----
|
|
\/
-----------------------------
| data area |
| |
| content is |
| L4 payload portion |
-----------------------------

In the meantime, "fragment_packet_new_in" _hijact data area of "fragment packet #0",
and link the rest "fragment packet"s in "inet_frag_queue->fragments", into "skb_shared_info->frag_list".

skb_morph(head, fq->q.fragments);

consume_skb(fq->q.fragments);

[*] This is like, construct a similar format to fragmentation "fast path"

"fragment_packet_new_in"

|
| hijack data area
|

---------------------- ----------------------- ----------------------- ----------------------- -----------------------
| "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" | | "fragment packet" |
| #0 | --- "skb_shared_info->frag_list" ---> | #1 | ---> | #2 | ---> | #3 | ---> ... ---> | #N |
| | | | | | | | | |
---------------------- ----------------------- ----------------------- ----------------------- -----------------------


That is, new ingress "skb" previously was a "fragment packet", now become the "original packet". # so this is the magic.

[*] But the "skb" pointer value is _NOT_ changed.

------------------------------------------------------

#5. Since now "skb" become the "original packet", then upload it to L4 layer.

ipv4: ip_local_deliver(skb)

ip_frag_reasm(skb) # "skb" become the "original packet" internally

#
# Upload the "original packet" to L4 layers.
#

ip_local_deliver_finish(skb)


ipv6: ip6_input_finish(skb)

resubmit:

"inet6_protocol->handler(skb)"

#
# "Fragment Header" extension header handler - "frag_protocol->handler()" = ipv6_frag_rcv( skb) # "skb" become the "original packet" internally
#
# goto next interation of "resumbit" label.
#
# then upload to final L4 protocol handler "inet6_protocol->handler()".
#

[*] It is OK to upload a reassemblied "original packet" with child skb fragments to L4 layer. Because later all
operation to this "original packet" happens in local node.

Commonly, the data content of reassemblied "original packet" skb, is copied to user-level "iovec" buffer provided
by socket recvmsg() system call, by:

skb_copy_datagram_iovec()

which handles skb data content in its "main data space", "children page fragments", and "children skb fragments"
carefully.

===============================================================================================================

@@ netfilter: fragment handling # <symlink> to: <<L3---fragment---netfilter.txt>>

===============================================================================================================

@@ ipsec: fragment handling # <symlink> to: <<ipsec---fragment.txt>>


===============================================================================================================

@@ end

秒客网

L3---fragment: 3层的分片/重组机制

相关文章