L3---fragment: 3层的分片/重组机制

时间:2022-08-02 17:58:35
#
# This doc describe:
#
#       _regular_ L3 (ipv4, ipv6) fragment handling # including: "fragmentation" - frag, "defragmentation" - defrag
#
#       And various networking module:  _special_ handling to L3 fragments. # like netfilter, ipsec
#




===============================================================================================================


--- index:


---------------------------------------------------------------------------------------------------------------


=> terms: "original packet", "fragment packet", "fragmentation" - frag, "defragmentation" - defrag


    => <quote> from ipv6 rfc <<rfc2460---Internet.Protocol.Version.6.(IPv6).Specification.txt>> / 4.5  Fragment Header


---------------------------------------------------------------------------------------------------------------


=> regular L3 fragment handling


    => frag # fast path / slow path


        => fast path


        => slow path


        => ipv4 router might further frag to forwarding ipv4 fragment, but ipv6 router not.


        => retransmission   # always a "original packet", so for all "fragment packet" of a "original packet", not for a single "fragment packet"


    ------------------------------------------------------


    => defrag


        => defrag logic # from local delivery path.


        => reassemly magic  # new ingress "fragment packet", _become_ "original packet"


---------------------------------------------------------------------------------------------------------------


=> netfilter: fragment handling # <symlink> to: <<L3---fragment---netfilter.txt>>


---------------------------------------------------------------------------------------------------------------


=> ipsec: fragment handling # <symlink> to: <<ipsec---fragment.txt>>


---------------------------------------------------------------------------------------------------------------


===============================================================================================================


@@ terms: "original packet", "fragment packet", "fragmentation" - frag, "defragmentation" - defrag


#
# These terms are defined by IPv6 rfc.
#
#
# Same definitions and rules also apply to IPv4     # --- it is just ipv6 RFC description is more clear.
#




===============================================================================================================


@@-@ <quote> from ipv6 rfc <<rfc2460---Internet.Protocol.Version.6.(IPv6).Specification.txt>> / 4.5  Fragment Header






   The initial, large, unfragmented packet is referred to as the
   "original packet", and it is considered to consist of two parts, as
   illustrated:


   original packet:


   +------------------+----------------------//-----------------------+
   |  Unfragmentable  |                 Fragmentable                  |
   |       Part       |                     Part                      |
   +------------------+----------------------//-----------------------+


      The Unfragmentable Part consists of the IPv6 header plus any
      extension headers that must be processed by nodes en route to the
      destination, that is, all headers up to and including the Routing
      header if present, else the Hop-by-Hop Options header if present,
      else no extension headers.


      The Fragmentable Part consists of the rest of the packet, that is,
      any extension headers that need be processed only by the final
      destination node(s), plus the upper-layer header and data.


   The Fragmentable Part of the original packet is divided into
   fragments, each, except possibly the last ("rightmost") one, being an
   integer multiple of 8 octets long.  The fragments are transmitted in
   separate "fragment packets" as illustrated:


   original packet:


   +------------------+--------------+--------------+--//--+----------+
   |  Unfragmentable  |    first     |    second    |      |   last   |
   |       Part       |   fragment   |   fragment   | .... | fragment |
   +------------------+--------------+--------------+--//--+----------+




   fragment packets:


   +------------------+--------+--------------+
   |  Unfragmentable  |Fragment|    first     |
   |       Part       | Header |   fragment   |
   +------------------+--------+--------------+


   +------------------+--------+--------------+
   |  Unfragmentable  |Fragment|    second    |
   |       Part       | Header |   fragment   |
   +------------------+--------+--------------+
                         o
                         o
                         o
   +------------------+--------+----------+
   |  Unfragmentable  |Fragment|   last   |
   |       Part       | Header | fragment |
   +------------------+--------+----------+


   Each fragment packet is composed of:


      (1) The Unfragmentable Part of the original packet, with the
          Payload Length of the original IPv6 header changed to contain
          the length of this fragment packet only (excluding the length
          of the IPv6 header itself), and the Next Header field of the
          last header of the Unfragmentable Part changed to 44.


      (2) A Fragment header containing:


               The Next Header value that identifies the first header of
               the Fragmentable Part of the original packet.


               A Fragment Offset containing the offset of the fragment,
               in 8-octet units, relative to the start of the
               Fragmentable Part of the original packet.  The Fragment
               Offset of the first ("leftmost") fragment is 0.


               An M flag value of 0 if the fragment is the last
               ("rightmost") one, else an M flag value of 1.


               The Identification value generated for the original
               packet.


      (3) The fragment itself.


   The lengths of the fragments must be chosen such that the resulting
   fragment packets fit within the MTU of the path to the packets'
   destination(s).




   At the destination, fragment packets are reassembled into their
   original, unfragmented form, as illustrated:


   reassembled original packet:


   +------------------+----------------------//------------------------+
   |  Unfragmentable  |                 Fragmentable                   |
   |       Part       |                     Part                       |
   +------------------+----------------------//------------------------+


   The following rules govern reassembly:


      An original packet is reassembled only from fragment packets that
      have the same Source Address, Destination Address, and Fragment
      Identification.


      The Unfragmentable Part of the reassembled packet consists of all
      headers up to, but not including, the Fragment header of the first
      fragment packet (that is, the packet whose Fragment Offset is
      zero), with the following two changes:


         The Next Header field of the last header of the Unfragmentable
         Part is obtained from the Next Header field of the first
         fragment's Fragment header.


         The Payload Length of the reassembled packet is computed from
         the length of the Unfragmentable Part and the length and offset
         of the last fragment.  For example, a formula for computing the
         Payload Length of the reassembled original packet is:


           PL.orig = PL.first - FL.first - 8 + (8 * FO.last) + FL.last


           where
           PL.orig  = Payload Length field of reassembled packet.
           PL.first = Payload Length field of first fragment packet.
           FL.first = length of fragment following Fragment header of
                      first fragment packet.
           FO.last  = Fragment Offset field of Fragment header of
                      last fragment packet.
           FL.last  = length of fragment following Fragment header of
                      last fragment packet.


      The Fragmentable Part of the reassembled packet is constructed
      from the fragments following the Fragment headers in each of the
      fragment packets.  The length of each fragment is computed by
      subtracting from the packet's Payload Length the length of the
      headers between the IPv6 header and fragment itself; its relative
      position in Fragmentable Part is computed from its Fragment Offset
      value.


      The Fragment header is not present in the final, reassembled
      packet.


   The following error conditions may arise when reassembling fragmented
   packets:


      If insufficient fragments are received to complete reassembly of a
      packet within 60 seconds of the reception of the first-arriving
      fragment of that packet, reassembly of that packet must be
      abandoned and all the fragments that have been received for that
      packet must be discarded.  If the first fragment (i.e., the one
      with a Fragment Offset of zero) has been received, an ICMP Time
      Exceeded -- Fragment Reassembly Time Exceeded message should be
      sent to the source of that fragment.


      If the length of a fragment, as derived from the fragment packet's
      Payload Length field, is not a multiple of 8 octets and the M flag
      of that fragment is 1, then that fragment must be discarded and an
      ICMP Parameter Problem, Code 0, message should be sent to the
      source of the fragment, pointing to the Payload Length field of
      the fragment packet.


      If the length and offset of a fragment are such that the Payload
      Length of the packet reassembled from that fragment would exceed
      65,535 octets, then that fragment must be discarded and an ICMP
      Parameter Problem, Code 0, message should be sent to the source of
      the fragment, pointing to the Fragment Offset field of the
      fragment packet.


   The following conditions are not expected to occur, but are not
   considered errors if they do:


      The number and content of the headers preceding the Fragment
      header of different fragments of the same original packet may
      differ.  Whatever headers are present, preceding the Fragment
      header in each fragment packet, are processed when the packets
      arrive, prior to queueing the fragments for reassembly.  Only
      those headers in the Offset zero fragment packet are retained in
      the reassembled packet.


      The Next Header values in the Fragment headers of different
      fragments of the same original packet may differ.  Only the value
      from the Offset zero fragment packet is used for reassembly.




===============================================================================================================


@@ regular L3 fragment handling


#
# By "regular L3 fragment handling", we mean: no other special networking module, such as netfilter, ipsec, etc.
#
#
# As for linux implementation, ipv4 and ipv6 fragmentation / defragmentation is quite similar. 
#
#       --- For defragmentation, there is even a common $(defrag) framework.
#
#
# The following description is common to ipv4 and ipv6.     # but more detailed and specific description, see those reference.
#




===============================================================================================================


@@-@ frag   # fast path / slow path


function:


    ipv4:   int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))


    ipv6:   int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))




===============================================================================================================


@@-@-@ fast path


Upper L4 layer would be aware of possible fragmentation effort of L3, and want to do some help to 
ease the L3 fragmentation effort.


So when L4 is pushing  egress data into its pending egress "original packet", would try to organize its 
egress "original packet" skb in a certain format which is proper for quick fragmentation logic. The format
is like this:


    sk_buff:        # its data area is like below


        | main data space |         # containing L4 header, and a small portion of L4 payload.
                                    #
                                    # The main data space valid content size is PMTU ( kept in socket )




        | children page fragments | = NULL  # "skb_shared_info->frags" = NULL no page fragments




        | children skb fragments | -> child skb #1 -> child skb #2 -> ... -> child skb #N


                                #
                                # "skb_shared_info->frag_list" != NULL 
                                #
                                # Each child skb fragment, contains futher portions L4 payload.
                                #
                                #   --- normally, each child skb fragment, is a linear skb, with only main data space.
                                #       and its valid content ( L4 payload portion) size is PMTU.
                                #
                                #
                                #   --- As L4 is pushing egress data into pending "original packet", it internally
                                #       allocate child skb, and copy egress data into child skb, and append it into
                                #       skb fragment list. 
                                #
                                #
                                #   --- The data of child skb fragments, is considered as part of its parent skb.
                                #       account in parent skb "sk_buff->len"




 representing as a figure:
                                                                                                                        
    ----------------------      -----------------------      -----------------------               ----------------------- 
    | parent skb         |      | child skb #1        |      | child skb #2        |               | child skb #N        | 
    |                    |      |                     |      |                     |               |                     |
    | main data space    |      | main data space     |      | main data space     |               | main data space     |
    | size =PMTU         |      | size =PMTU          |      | size =PMTU          |               | size =PMTU          |
    |                    | ---> |                     | ---> |                     | ---> ... ---> |                     |
    | L4 header          |      |                     |      |                     |               |                     |
    | L4 payload portion |      | L4 payload portion  |      | L4 payload portion  |               | L4 payload portion  | 
    |                    |      |                     |      |                     |               |                     |
    ----------------------      -----------------------      -----------------------               ----------------------- 




 Then, the L3 fragmentation function would check if a egress "original packet" skb is suitable for fast path, mainly on following 2 aspects:


        of the above format


        the main data space length of parent skb and every child skb fragments is less than PMTU.   




 If OK, then fast path is performed this way:


        Construct L3 header, for each skb ( parent skb, and every child skb fragments ).
        Since they (parent skb, and every child skb fragments )  have L3 headers now, they become "fragment packet".


        Unlink all child skb fragments from their parent. 


            #
            # After child skb fragments get unlinked, they are independent from their parent, and they data
            # are no longer considered as part of its parent skb, not account in parent skb "sk_buff->len".
            #




 a figure is like this:




now become:
        
    L3 "fragment packet #0"     "fragment packet #1"        "fragment packet #2"                    "fragment packet #N"


    ----------------------      -----------------------      -----------------------               ----------------------- 
    | old parent skb     |      | old child skb #1    |      | old child skb #2    |               | old child skb #N    | 
    |                    |      |                     |      |                     |               |                     |
    | main data space    |      | main data space     |      | main data space     |               | main data space     |
    | size =PMTU         |      | size =PMTU          |      | size =PMTU          |               | size =PMTU          |
    |                    | -X-> |                     | -X-> |                     | -X-> ... -X-> |                     |
    | L3 header          |  /\  |                     |      |                     |               |                     |
    | L4 header          |  |   | L3 header           |      | L3 header           |               | L3 header           |
    | L4 payload portion |  |   | L4 payload portion  |      | L4 payload portion  |               | L4 payload portion  | 
    |                    |  |   |                     |      |                     |               |                     |
    ----------------------  |   -----------------------      -----------------------               ----------------------- 
                            |
                            |---- X means "unlink"




--- We see, there is no new skb allocation and memory copy in this logic, so it fast, so called "fast path".




===============================================================================================================


@@-@-@ slow path


If L3 fragmentation function check if a egress "original packet" skb is _NOT_ suitable for fast path, like:


        not of fast path format     # eg. a big linear skb.


        or, of that format, but length check not pass for PMTU




then, slow path would be performed to fragmentation, simply:


        Looply, 


            allocate a new "sk_buff" with PMTU length.


            copy data content (L4 header + L4 payload) from "original packet" skb, into new "sk_buff".


            construct L3 header, for each new "sk_buff", making them "fragment packet"


            Until data content of "original packet" skb is copied complete.




--- We see, there are some new skb allocation and memory copy in this logic, so it is slower than "fast path", so called "slow path".




===============================================================================================================


@@-@-@ ipv4 router might further frag to forwarding ipv4 fragment, but ipv6 router not.




Normally, we think fragmentation happens at source node.




But ipv4 router could further frag a forwarding ipv4 fragment. 




For example.


                            ---> "fragment packet" ---> 


        | source node | ---> | router #a | ---> | router #b | ---> | router #c | ---> | destination node |




    At first, "fragment packet" size is less than PMTU, so it can pass through.


    Then suddenly, the MTU ( note, not PMTU ) between router #b to router #c, decrease to be less than "fragment packet"
    size. Such as:


                CLI "interface   router_#b_interface_to_router_#c  mtu   from  1500   to 1000"


                or routing protocol on router #b select another router #d as its next hop






In this case, the following things would happen:


    #1. If 'DF' ( don't fragment ) bit in IPv4 header of this "fragment packet" is SET, then it doesn't allow 
        futher fragmentation by intermediate router.


        In this case, router would send back a ICMP error message ( TooBig ) to source node, for PMTU logic.




    #2. Otherwise, depending on router configuration.


            If set router NOT futher fragmentation, then like #1, send back ICMP error message ( TooBig ).




            If set router to fo futher fragmentation, then:


                    IPv4 fragmentation function would call to this "fragment packet",   # by router.
                    make it into 2 small "fragment packets".                            #     and normally, fragmentation "slow path"




---------------------------------------------------------------------------------------------------------------


--- ipv4 code:


    "dst_entry->input()" = ip_forward(skb)      # ip4 forward path


        #
        # If forwarding ip4 packet (might be a large fragment ) exceed next hop MTU, then only if:
        #
        #           DF "Don't Fragment" bit is set in ip4 header    # explicitly not allow perform fragmentation
        #
        # Then send back icmp4 ( ICMP_DEST_UNREACH,  ICMP_FRAG_NEEDED ) message to source node.
        #
        #
        # Otherwise, if DF bit is not set, ip4 forward path would continue to reach ip_finish_output(),
        # there it would perform further fragmentation to the large fragment.
        #


        if (unlikely(skb->len > mtu && !skb_is_gso(skb) &&
                 (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {


            IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);


            icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
                  htonl(mtu));
            goto drop;
        }




        ip_forward_finish()


            dst_output(skb); = "dst_entry->output()" = ip_output() 


                ip_finish_output()


                    if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
                        return ip_fragment(skb, ip_finish_output2);             # further fragmenetation happens here.
                    else
                        return ip_finish_output2(skb);




---------------------------------------------------------------------------------------------------------------


Under the same case, ipv6 router is always NOT allowed to futher fragmentation a large ipv6 "fragment packet".


This is specified by RFC:


    <<rfc2460>> / 4.5  Fragment Header


            (Note: unlike IPv4, fragmentation in IPv6 is performed only by source nodes, not by
           routers along a packet's delivery path -- see section 5.)




[tip] Well, implementation is always different from specification, exception exist as investigating netfilter and ipsec.


        see <<L3---fragment---netfilter.txt>>




---------------------------------------------------------------------------------------------------------------


--- ipv6 code:


    "dst_entry->input()" = ip6_forward()        # ip6 forward path


        #
        # These are the same check conditions like in ip6_output() for calling ip6_fragment().
        #
        #
        # Because for ip6, fragmentation is only performed by source node. And intermiedate routers
        #       should not perform further fragmentation to a large fragment.
        #
        #
        # So here, if "skb->len" > mtu, we simply send back icmp6 "Packet Too Big Message" and return, 
        # no chance to reach ip6_fragment() frag logic in ip6_finish_output()
        #


        if ((!skb->local_df && skb->len > mtu && !skb_is_gso(skb)) ||
             (IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)) {


            /* Again, force OUTPUT device used as source address */
            skb->dev = dst->dev;
            icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
            IP6_INC_STATS_BH(net, ip6_dst_idev(dst),
                     IPSTATS_MIB_INTOOBIGERRORS);
            IP6_INC_STATS_BH(net, ip6_dst_idev(dst),
                     IPSTATS_MIB_FRAGFAILS);
            kfree_skb(skb);
            return -EMSGSIZE;


        }
    
        ip6_forward_finish()


            dst_output(skb); = "dst_entry->output()" = ip6_output(skb)


                ip6_finish_output(skb)


                    #
                    # frag logic happen here.
                    #


                    if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
                        dst_allfrag(skb_dst(skb)) ||
                        (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
                        return ip6_fragment(skb, ip6_finish_output2);




---------------------------------------------------------------------------------------------------------------


===============================================================================================================


@@-@-@ retransmission   # always a "original packet", so for all "fragment packet" of a "original packet", not for a single "fragment packet"


packet loss is common in network.


Some "fragment packet" get lost in network path, making "original packet" can not be defragmented in destination node.


Retransmission is performed by L4 layer of source node.


    L4 layer (like TCP) consider the whole "original packet" as lost, and retransmit it.    # L4 doesn't know lost "fragment packet".


    The retransmitted "original packet" would go through fragmentation logic, AGAIN.        




===============================================================================================================


@@-@ defrag


    ipv4:   int ip_defrag(struct sk_buff *skb, u32 user)


    ipv6:   "Fragment Header" extension header handler - "frag_protocol" "->handler()" = ipv6_frag_rcv()




===============================================================================================================


@@-@-@ defrag logic # from local delivery path.


#
# local delivery path:
#
#   packet destination address is one unicast L3 address configured to local node, or multicast L3 address interested by
# local node.
#
#   Then ingress routing lookup should determine that this ingress packet should be uploaded to L4 layer.
#
#
#   If this ingress packet is a "fragment packet", then it should get into defrag logic first.
#
#   There might be 2 result back from defrag logic:
#
#       #1. Have not received all "fragment packet" for a "original packet", then this "fragment packet"
#           is enqueued into some internal queue, waiting for other "fragment packet" comming in.
#
#           In this case, this "fragment packet" is considered as taken by defrag logic,
#           and its handling is stopped temporarily.
#
#
#       #2. This ingress "fragment packet" make the pending internal queue complete.
#
#               --- Note that, "fragment packet"s might arrive out of order, so it might not be the 
#                   last "fragment packet", but could be any intermediate or 1st "fragment packet".
#
#           In this case, reassemly operation is performed.
#
#           [*] And reassemly operation plays a magic:
#
#                   make the ingress "fragment packet" skb, become the "original packet".
#
#           Then, continue to upload the reassemblied "original packet" to L4 RX handler.
#


---------------------------------------------------------------------------------------------------------------


--- ipv4 case:


    netif_receive_skb()     # L2->L3 RX interface
    
        "ip_packet_type->func" = ip_rcv() -> ip_rcv_finish()


            #
            # ingress routing lookup. 
            #
            # then "skb_dst(skb)->input" would be set to:
            #
            #       #1. ip_local_deliver()      # if packet dst_addr is UC address configured on local node.
            #
            #       #2. ip_mr_input()           # if packet dst_addr is MC address, then it would be:
            #                                   #   #a. if multicast routing, then forwarded by ip_mr_forward() with a skb clone.
            #                                   #   #b. then, if interested by local node, call to ip_local_deliver().
            #


            ip_route_input_noref(skb, iph->daddr, iph->saddr,
                           iph->tos, skb->dev);




            dst_input(skb) = ip_local_deliver()
    
                #
                # defrag logic.
                #


                if (ip_is_fragment(ip_hdr(skb))) {
                    if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
                        return 0;
                }
            
                return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
                           ip_local_deliver_finish);
            


---------------------------------------------------------------------------------------------------------------


--- ipv6 case:


    netif_receive_skb()     # L2->L3 RX interface
    
        "ipv6_packet_type->func()" = ipv6_rcv() -> ip6_rcv_finish()


            #
            # ingress routing lookup. 
            #
            # then "skb_dst(skb)->input" would be set to:
            #
            #       #1. ip6_input()         # if packet dst_addr is UC address configured on local node.
            #
            #       #2. ip6_mc_input()      # if packet dst_addr is MC address, then it would be:
            #                               #   #a. if multicast routing, then forwarded by ip6_mr_forward() with a skb clone.
            #                               #   #b. then, if interested by local node, call to ip6_input().


            if (!skb_dst(skb))
                ip6_route_input(skb);




            /*
             *  Deliver IP Packets to the higher protocol layers.
             */


            dst_input(skb) = ip6_input() -> ip6_input_finish()




                resubmit:
                    nhoff = IP6CB(skb)->nhoff;
                    nexthdr = skb_network_header(skb)[nhoff];
                
                    if ((ipprot = rcu_dereference(inet6_protos[nexthdr])) != NULL) {
                
                        ret = ipprot->handler(skb);
                        if (ret > 0)
                            goto resubmit;
                    
                        ------------------------------------------------------


                        #
                        # defrag logic:
                        #
                        #       "Fragment Header" extension header handler - "frag_protocol" "->handler()" = ipv6_frag_rcv()
                        #
                
                        ------------------------------------------------------
                    }
                    else {
                        #
                        # release the ingress packet.
                        #
                        kfree_skb(skb); / consume_skb(skb);
                    }




===============================================================================================================


@@-@-@ reassemly magic  # new ingress "fragment packet", _become_ "original packet"


function:


    ipv4:   ip_frag_reasm()


    ipv6:   ip6_frag_reasm()






All pending "fragment packet" of a "original packet", are organized into a internal queue structure.


    struct inet_frag_queue {


        #
        # Linked by "sk_buff->next" field.
        #


        struct sk_buff      *fragments; /* list of received fragments */
    }




A figure of pending "fragment packet" is like:


    ----------------------      -----------------------      -----------------------               ----------------------- 
    | "fragment packet"  |      | "fragment packet"   |      |  "fragment packet"  |               | "fragment packet"   | 
    |       #0           | ---> |       #1            | ---> |      #3             | ---> ... ---> |    #N               |
    |                    |      |                     |      |                     |               |                     |
    ----------------------      -----------------------      -----------------------               ----------------------- 




If a new ingress "fragment packet" make the pending "original packet" queue complete, then reassemly plays its magic to 
get the "original packet".


------------------------------------------------------


#1. Intentionally, _NOT_ link new ingress "fragment packet" into "inet_frag_queue->fragments" list.


                                            ----------------------
                                            | "fragment packet"  |      # call it as "fragment_packet_new_in"
                                            |       #2           |
                                            |                    |
                                            ----------------------




    ----------------------      -----------------------      -----------------------               ----------------------- 
    | "fragment packet"  |      | "fragment packet"   |      |  "fragment packet"  |               | "fragment packet"   | 
    |       #0           | ---> |       #1            | ---> |      #3             | ---> ... ---> |    #N               |
    |                    |      |                     |      |                     |               |                     |
    ----------------------      -----------------------      -----------------------               ----------------------- 




------------------------------------------------------


#2. create a skb clone to "fragment_packet_new_in"


                ----------------------                              ----------------------  
                | "fragment packet"  |      skb_clone()             | clone of           |  
                |       #2           |  create a skb clone          |   "fragment packet"|    # call it as "fragment_packet_new_in_CLONE"
                |                    |                              |       #2           |          
                ----------------------                              ----------------------  
                        |                                                   |
                        |                                                   |
                        ------ share ----------------------------- share ----
                                                |
                                                |   
                                                \/
                                    -----------------------------
                                    |   data area               |
                                    |                           |
                                    |   content is              |
                                    |       L4 payload portion  |
                                    -----------------------------




------------------------------------------------------


#3. Link "fragment_packet_new_in_CLONE" ( not "fragment_packet_new_in" ) into "inet_frag_queue->fragments"




                                                            #
                                                            # In fact, this is
                                                            #   "fragment_packet_new_in_CLONE"  
                                                            #   


    ----------------------      -----------------------      -----------------------      -----------------------              ----------------------- 
    | "fragment packet"  |      | "fragment packet"   |      |  "fragment packet"  |      | "fragment packet"  |               | "fragment packet"   | 
    |       #0           | ---> |       #1            | ---> |      #2             | ---> |         #3         | ---> ... ---> |    #N               |
    |                    |      |                     |      |                     |      |                    |               |                     |
    ----------------------      -----------------------      -----------------------      -----------------------               ----------------------- 
                                                                                        


------------------------------------------------------


#4. Make "fragment_packet_new_in" _discard_ its data area, so "fragment_packet_new_in_CLONE" now has
    the exclusive ownership to the data area.


                ----------------------------                            ----------------------------------  
                | "fragment_packet_new_in" |                            | "fragment_packet_new_in_CLONE" |
                |                          |                            |                                |
                |                          |                            |                                |  
                ----------------------------                            ----------------------------------
                        |                                                   |
                        |                                                   |
                        ---- _discard_          ---- exclusive ownership ----
                                                |
                                                |   
                                                \/
                                    -----------------------------
                                    |   data area               |
                                    |                           |
                                    |   content is              |
                                    |       L4 payload portion  |
                                    -----------------------------




    In the meantime, "fragment_packet_new_in" _hijact data area of "fragment packet #0", 
    and link the rest "fragment packet"s in "inet_frag_queue->fragments", into "skb_shared_info->frag_list".


            skb_morph(head, fq->q.fragments);


            consume_skb(fq->q.fragments);




    [*] This is like, construct a similar format to fragmentation "fast path"


            "fragment_packet_new_in"
                                                 
                        |
                        | hijack data area
                        | 


            ----------------------                                       -----------------------     -----------------------      -----------------------              ----------------------- 
            | "fragment packet"  |                                       | "fragment packet"  |      |  "fragment packet"  |      | "fragment packet"  |               | "fragment packet"   | 
            |       #0           | --- "skb_shared_info->frag_list" ---> |      #1            | ---> |      #2             | ---> |         #3         | ---> ... ---> |    #N               |
            |                    |                                       |                    |      |                     |      |                    |               |                     |
            ----------------------                                       -----------------------     -----------------------      -----------------------               ----------------------- 
    


    That is, new ingress "skb" previously was a "fragment packet", now become the "original packet".        # so this is the magic.


    [*] But the "skb" pointer value is _NOT_ changed.




------------------------------------------------------


#5. Since now "skb" become the "original packet", then upload it to L4 layer.


    ipv4:   ip_local_deliver(skb)


                ip_frag_reasm(skb)      # "skb" become the "original packet" internally


                #
                # Upload the "original packet" to L4 layers.
                #


                ip_local_deliver_finish(skb)


    
    ipv6:   ip6_input_finish(skb)


                resubmit:
                    
                    "inet6_protocol->handler(skb)"


                    #   
                    # "Fragment Header" extension header handler - "frag_protocol->handler()" = ipv6_frag_rcv( skb)     # "skb" become the "original packet" internally
                    #
                    #       goto next interation of "resumbit" label.
                    #
                    #           then upload to final L4 protocol handler "inet6_protocol->handler()".
                    #




 [*] It is OK to upload a reassemblied "original packet" with child skb fragments to L4 layer. Because later all
     operation to this "original packet" happens in local node.


    Commonly, the data content of reassemblied "original packet" skb, is copied to user-level "iovec" buffer provided
    by socket recvmsg() system call, by:


            skb_copy_datagram_iovec()


    which handles skb data content in its "main data space", "children page fragments", and "children skb fragments"
    carefully.




===============================================================================================================


@@ netfilter: fragment handling # <symlink> to: <<L3---fragment---netfilter.txt>>




===============================================================================================================


@@ ipsec: fragment handling # <symlink> to: <<ipsec---fragment.txt>>
    


===============================================================================================================


@@ end