基于空格或“双引号字符串”将字符串解析为数组

Im trying to take a user input string and parse is into an array called char *entire_line[100]; where each word is put at a different index of the array but if a part of the string is encapsulated by a quote, that should be put in a single index. So if I have

我试图获取一个用户输入字符串，并将其解析为一个名为char * tot_line[100]的数组;每个单词都放在一个不同的数组索引中，但是如果字符串的一部分被引用封装，那么应该将其放入一个索引中。所以如果我有

char buffer[1024]={0,};
fgets(buffer, 1024, stdin);

example input: "word filename.txt "this is a string that shoudl take up one index in an output array";

示例输入:”字文件名。txt "这是一个字符串，应该在输出数组中占用一个索引";

tokenizer=strtok(buffer," ");//break up by spaces
        do{
            if(strchr(tokenizer,'"')){//check is a word starts with a "
            is_string=YES;
            entire_line[i]=tokenizer;// if so, put that word into current index
            tokenizer=strtok(NULL,"\""); //should get rest of string until end "
            strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue

              }  
        entire_line[i]=tokenizer;
        i++;
        }while((tokenizer=strtok(NULL," \n"))!=NULL);

This clearly isn't working and only gets close if the double quote encapsulated string is at the end of the input string but i could have input: word "this is text that will be user entered" filename.txt Been trying to figure this out for a while, always get stuck somewhere. thanks

这显然是行不通的，只有当双引号封装的字符串位于输入字符串的末尾时才会接近，但我可以输入:word“这是将被用户输入的文本”文件名。txt想要弄明白这一点已经有一段时间了，总是被困在某个地方。谢谢

4 个解决方案

#1

The strtok function is a terrible way to tokenize in C, except for one (admittedly common) case: simple whitespace-separated words. (Even then it's still not great due to lack of re-entrance and recursion ability, which is why we invented strsep for BSD way back when.)

strtok函数是在C语言中进行标记的一种糟糕的方式，除了一种(公认的常见)情况:简单的白色分隔的单词。(即使如此，由于缺乏重新进入和递归能力，它仍然不是很好，这就是为什么我们在BSD时为strsep发明了。)

Your best bet in this case is to build your own simple state-machine:

在这种情况下，最好的办法是构建自己的简单状态机:

char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;

for (p = buffer; *p != '\0'; p++) {
    c = (unsigned char) *p; /* convert to unsigned char for is* functions */
    switch (state) {
    case DULL: /* not in a word, not in a double quoted string */
        if (isspace(c)) {
            /* still not in a word, so ignore this char */
            continue;
        }
        /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
        if (c == '"') {
            state = IN_STRING;
            start_of_word = p + 1; /* word starts at *next* char, not this one */
            continue;
        }
        state = IN_WORD;
        start_of_word = p; /* word starts here */
        continue;

    case IN_STRING:
        /* we're in a double quoted string, so keep going until we hit a close " */
        if (c == '"') {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_STRING or we handled the end above */

    case IN_WORD:
        /* we're in a word, so keep going until we get to a space */
        if (isspace(c)) {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_WORD or we handled the end above */
    }
}

Note that this does not account for the possibility of a double quote inside a word, e.g.:

注意，这并不能解释在一个单词中出现双引号的可能性，例如:

"some text in quotes" plus four simple words p"lus something strange"

Work through the state machine above and you will see that "some text in quotes" turns into a single token (that ignores the double quotes), but p"lus is also a single token (that includes the quote), something is a single token, and strange" is a token. Whether you want this, or how you want to handle it, is up to you. For more complex but thorough lexical tokenization, you may want to use a code-building tool like flex.

通过上面的状态机，您将看到“引号中的某些文本”变成了单个令牌(忽略双引号)，但是p“lus”也是一个令牌(包括引号)，有些东西是单个令牌，而strange”是一个令牌。你是否想要这个，或者你想如何处理它，都取决于你自己。对于更复杂但更全面的词汇标记化，您可能需要使用像flex这样的代码构建工具。

Also, when the for loop exits, if state is not DULL, you need to handle the final word (I left this out of the code above) and decide what to do if state is IN_STRING (meaning there was no close-double-quote).

此外，当for循环退出时，如果状态不是无趣的，则需要处理最后一个词(我在上面的代码中省略了这个)，并决定如果state是IN_STRING(意味着没有关闭双引号)，该怎么办。

#2

Torek's parts of parsing code are excellent but require little more work to use.

Torek的解析代码部分非常优秀，但是几乎不需要更多的工作来使用。

For my own purpose, I finished c function.
Here I share my work that is based on Torek's code.

为了我自己的目的，我完成了c函数。在这里，我分享了基于Torek代码的工作。

#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t split(char *buffer, char *argv[], size_t argv_size)
{
    char *p, *start_of_word;
    int c;
    enum states { DULL, IN_WORD, IN_STRING } state = DULL;
    size_t argc = 0;

    for (p = buffer; argc < argv_size && *p != '\0'; p++) {
        c = (unsigned char) *p;
        switch (state) {
        case DULL:
            if (isspace(c)) {
                continue;
            }

            if (c == '"') {
                state = IN_STRING;
                start_of_word = p + 1; 
                continue;
            }
            state = IN_WORD;
            start_of_word = p;
            continue;

        case IN_STRING:
            if (c == '"') {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;

        case IN_WORD:
            if (isspace(c)) {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;
        }
    }

    if (state != DULL && argc < argv_size)
        argv[argc++] = start_of_word;

    return argc;
}
void test_split(const char *s)
{
    char buf[1024];
    size_t i, argc;
    char *argv[20];

    strcpy(buf, s);
    argc = split(buf, argv, 20);
    printf("input: '%s'\n", s);
    for (i = 0; i < argc; i++)
        printf("[%u] '%s'\n", i, argv[i]);
}
int main(int ac, char *av[])
{
    test_split("\"some text in quotes\" plus four simple words p\"lus something strange\"");
    return 0;
}

See program output:

看到程序输出:

input: '"some text in quotes" plus four simple words p"lus something strange"'
[0] 'some text in quotes'
[1] 'plus'
[2] 'four'
[3] 'simple'
[4] 'words'
[5] 'p"lus'
[6] 'something'
[7] 'strange"'

输入:' 'some text in quotes" + 4个简单单词p"lus something strange"' [0] 'some text in quotes' [1] 'plus' [2] 'four' [3] 'simple' [4] 'words [5] 'p ' lus' [6] 'something' [7] 'strange ' '

#3

I wrote a qtok function some time ago that reads quoted words from a string. It's not a state machine and it doesn't make you an array but it's trivial to put the resulting tokens into one. It also handles escaped quotes and trailing and leading spaces:

不久前我编写了一个qtok函数，从字符串中读取引用的单词。它不是一个状态机，也不是一个数组，但是将结果令牌放入其中是很简单的。它还处理转义引号、尾距和前导空格:

#include <stdio.h>
#include <ctype.h>
#include <assert.h>

// Strips backslashes from quotes
char *unescapeToken(char *token)
{
    char *in = token;
    char *out = token;

    while (*in)
    {
        assert(in >= out);

        if ((in[0] == '\\') && (in[1] == '"'))
        {
            *out = in[1];
            out++;
            in += 2;
        }
        else
        {
            *out = *in;
            out++;
            in++; 
        }
    }
    *out = 0;
    return token;
}

// Returns the end of the token, without chaning it.
char *qtok(char *str, char **next)
{
    char *current = str;
    char *start = str;
    int isQuoted = 0;

    // Eat beginning whitespace.
    while (*current && isspace(*current)) current++;
    start = current;

    if (*current == '"')
    {
        isQuoted = 1;
        // Quoted token
        current++; // Skip the beginning quote.
        start = current;
        for (;;)
        {
            // Go till we find a quote or the end of string.
            while (*current && (*current != '"')) current++;
            if (!*current) 
            {
                // Reached the end of the string.
                goto finalize;
            }
            if (*(current - 1) == '\\')
            {
                // Escaped quote keep going.
                current++;
                continue;
            }
            // Reached the ending quote.
            goto finalize; 
        }
    }
    // Not quoted so run till we see a space.
    while (*current && !isspace(*current)) current++;
finalize:
    if (*current)
    {
        // Close token if not closed already.
        *current = 0;
        current++;
        // Eat trailing whitespace.
        while (*current && isspace(*current)) current++;
    }
    *next = current;

    return isQuoted ? unescapeToken(start) : start;
}

int main()
{
    char text[] = "   \"some text in quotes\"    plus   four simple words p\"lus something strange\" \"Then some quoted \\\"words\\\", and backslashes: \\ \\ \"  Escapes only work insi\\\"de q\\\"uoted strings\\\"   ";

    char *pText = text;

    printf("Original: '%s'\n", text);
    while (*pText)
    {
        printf("'%s'\n", qtok(pText, &pText));
    }

}

Outputs:

输出:

Original: '   "some text in quotes"    plus   four simple words p"lus something strange" "Then some quoted \"words\", and backslashes: \ \ "  Escapes only work insi\"de q\"uoted strings\"   '
'some text in quotes'
'plus'
'four'
'simple'
'words'
'p"lus'
'something'
'strange"'
'Then some quoted "words", and backslashes: \ \ '
'Escapes'
'only'
'work'
'insi\"de'
'q\"uoted'
'strings\"'

#4

I think the answer to your question is actually fairly simple, but I'm taking on an assumption where it seems the other responses have taken a different one. I'm assuming that you want any quoted block of text to be separated out on its own regardless of spacing with the rest of the text being separated by spaces.

我认为你的问题的答案其实很简单，但我的假设是，其他的回答似乎有不同的答案。我假设你想要任何引用的文本块被单独分离出来，而不考虑其他文本之间的间隔。

So given the example:

所以考虑到的例子:

"some text in quotes" plus four simple words p"lus something strange"

"some text in quotes" and four simple words p"lus something strange"

The output would be:

的输出是:

[0] some text in quotes

[0]一些引用的文本。

[1] plus

[1]+

[2] four

[2]四

[3] simple

[3]简单

[4] words

[4]字

[5] p

[5]p

[6] lus something strange

[6]逻辑单元奇怪的东西

Given that this is the case, only a simple bit of code is required, and no complex machines. You would first check if there is a leading quote for the first character and if so tick a flag and remove the character. As well as removing any quotes at the end of the string. Then tokenize the string based on quotation marks. Then tokenize every other of the strings obtained previously by spaces. Tokenize starting with the first string obtained if there was no leading quote, or the second string obtained if there was a leading quote. Then each of the remaining strings from the first part will be added to an array of strings interspersed with the strings from the second part added in place of the strings they were tokenized from. In this way you can get the result listed above. In code this would look like:

考虑到这一点，只需要一个简单的代码，而不需要复杂的机器。您将首先检查是否有第一个字符的引言，如果有，勾选一个标志并删除该字符。以及删除字符串末尾的任何引号。然后根据引号来标记字符串。然后用空格来标记前面得到的所有字符串。从第一个字符串开始，如果没有引语，或者第二个字符串，如果有一个引语。然后，第一部分剩下的每一个字符串都将被添加到一组字符串中，这些字符串与第二部分的字符串穿插在一起，以取代被标记的字符串。通过这种方式，您可以得到上面列出的结果。在代码中，这看起来像:

#include<string.h>
#include<stdlib.h>

char ** parser(char * input, char delim, char delim2){
    char ** output;
    char ** quotes;
    char * line = input;
    int flag = 0;
    if(strlen(input) > 0 && input[0] == delim){
        flag = 1;
        line = input + 1;
    }
    int i = 0;
    char * pch = strchr(line, delim);
    while(pch != NULL){
        i++;
        pch = strchr(pch+1, delim);
    }
    quotes = (char **) malloc(sizeof(char *)*i+1);
    char * token = strtok(input, delim);
    int n = 0;
    while(token != NULL){
        quotes[n] = strdup(token);
        token = strtok(NULL, delim);
        n++;
    }
    if(delim2 != NULL){
        int j = 0, k = 0, l = 0;
        for(n = 0; n < i+1; n++){
            if(flag & n % 2 == 1 || !flag & n % 2 == 0){
                char ** new = parser(delim2, NULL);
                l = sizeof(new)/sizeof(char *);
                for(k = 0; k < l; k++){
                    output[j] = new[k];
                    j++;
                }
                for(k = l; k > -1; k--){
                    free(new[n]);
                }
                free(new);
            } else {
                output[j] = quotes[n];
                j++;
            }
        }
        for(n = i; n > -1; n--){
            free(quotes[n]);
        }
        free(quotes);
    } else {
        return quotes;
    }
    return output;
}

int main(){
    char * input;
    char ** result = parser(input, '\"', ' ');

    return 0;
}

(May not be perfect, I haven't tested it)

(可能不完美，我还没有测试过)

#1