Lua截取utf-8编码的中英文混合字符串

参考博客：UTF8字符串在lua的截取和字数统计【转载】

需求

按字面个数来截取子字符串

函数(字符串, 开始位置, 截取长度)

utf8sub("你好1世界哈哈",,)    =    好1世界哈

utf8sub("1你好1世界哈哈",,)    =    你好1世界

utf8sub("你好世界1哈哈",,)    =    你好世界1

utf8sub("",,)    =

utf8sub("øpø你好pix",,)    =    pø你好p

错误方法

网上找了一些算法, 都不太正确; 要么就是乱码, 要么就是只考虑了4 byte 中文的情况, 不够全面

1. string.sub(s,1,截取长度*4)

　　网上很多直接使用"`""string.sub(s,1,截取长度*4)`"是肯定不对的, 因为如果中英文混合的字符串, 例如`你好1世界`的字符长度分别是`4,4,1,4,4`, 如果截取4个字, 4*4=4+4+1+4+3, 那`世界`的`界`字将会被取前3个byte, 就会出现乱码

2. if byte>128 then index = index + 4

问题关键

1. utf8字符是变长字符

2. 字符长度有规律

如文字符编码中所列，utf-8是对unicode字符集的编码方案。因此其变长编码方式为：

一字节：0*******

两字节：110*****，10******

三字节：1110****，10******，10******

四字节：11110***，10******，10******，10******

五字节：111110**，10******，10******，10******，10******

六字节：1111110*，10******，10******，10******，10******，10******

因此，拿到字节串后，想判断UTF8字符的byte长度，按照上文的规律，只需要获取该字符的首个Byte，根据其值就可以判断出该字符由几个Byte表示。

其代码如下：

local funciton charsize(ch)

    if not ch then return

    elseif ch >= then return

    elseif ch >=  and ch <  then return

    elseif ch >=  and ch <  then return

    elseif ch >=  and ch <  then return

    elseif ch >=  and ch <  then return

    elseif ch <  then return

    end

end

-- 计算utf8字符串字符数, 各种字符都按一个字符计算

-- 例如utf8len("1你好") => 3

function utf8len(str)

    local len =

    local aNum =  --字母个数

    local hNum =  --汉字个数

    local currentIndex =

    while currentIndex <= #str do

        local char = string.byte(str, currentIndex)

        local cs = charsize(char)

        currentIndex = currentIndex + cs

        len = len +

        if cs ==  then

            aNum = aNum +

        elseif cs >=  then

            hNum = hNum +

        end

    end

    return len, aNum, hNum

end

-- 截取utf8 字符串

-- str:            要截取的字符串

-- startChar:    开始字符下标,从1开始

-- numChars:    要截取的字符长度

function utf8sub(str, startChar, numChars)

    local startIndex =

    while startChar >  do

        local char = string.byte(str, startIndex)

        startIndex = startIndex + chsize(char)

        startChar = startChar -

    end

    local currentIndex = startIndex

    while numChars >  and currentIndex <= #str do

        local char = string.byte(str, currentIndex)

        currentIndex = currentIndex + chsize(char)

        numChars = numChars -

    end

    return str:sub(startIndex, currentIndex - )

end

-- 自测

function test()

    -- test utf8len

    assert(utf8len("你好1世界哈哈") == )

    assert(utf8len("你好世界1哈哈 ") == )

    assert(utf8len(" 你好世 界1哈哈") == )

    assert(utf8len("") == )

    assert(utf8len("øpø你好pix") == )

    -- test utf8sub

    assert(utf8sub("你好1世界哈哈",,) == "好1世界哈")

    assert(utf8sub("1你好1世界哈哈",,) == "你好1世界")

    assert(utf8sub(" 你好1世界 哈哈",,) == "你好1世界 ")

    assert(utf8sub("你好世界1哈哈",,) == "你好世界1")

    assert(utf8sub("",,) == "")

    assert(utf8sub("øpø你好pix",,) == "pø你好p")

    print("all test succ")

end

test()

秒客网

Lua截取utf-8编码的中英文混合字符串

参考博客：UTF8字符串在lua的截取和字数统计【转载】

需求

按字面个数来截取子字符串

错误方法

问题关键

相关文章