一个超级难的字符串解析!!!!!

22@3@3@文件2 ="数据表0""dd,dd""1.doc",
22@3@3@文件１="数据表表0""dd,dd""1.doc",
25@3@7@附件２=doc,
22@45@8@附件3=""",22@3@7@test",
22@3@9@附件4=""",22@3@7@test""",
2@30@10@附件5="""aaaaaa,22@3@7@test=fffff",
22@3@11@附件6="aaaaaa,22@3@7@"

以上的字符串需要做以下解析
  数据结构是：名字1=值1，名字2=值2 每一对值以逗号分割，但是如果值里有逗号或双引号的话，那么用双引号把值括起来，并且是把值中的双引号变成两个双引号即 " ——〉""
  我要的得到数据是：
  文件2 =数据表0"dd,dd"1.doc
  文件１=数据表表0"dd,dd"1.doc
  附件4=",22@3@7@test"

......

13 个解决方案

#1

#2

好奇特的字符串

#3

比csv文件还复杂?
考不考虑换行的情况啊?
要是考虑换行，呵呵，更麻烦,这个要用正则写出来挺难的

#4

给你个csv文件的例子:
import java.util.*;

/** Parse comma-separated values (CSV), a common Windows file format.
* Sample input: "LU",86.25,"11/4/1998","2:19PM",+4.0625
* <p>
* Inner logic adapted from a C++ original that was
* Copyright (C) 1999 Lucent Technologies
* Excerpted from 'The Practice of Programming'
* by Brian W. Kernighan and Rob Pike.
* <p>
* Included by permission of the http://tpop.awl.com/ web site,
* which says:
* "You may use this code for any purpose, as long as you leave
* the copyright notice and book citation attached." I have done so.
* @author Brian W. Kernighan and Rob Pike (C++ original)
* @author Ian F. Darwin (translation into Java and removal of I/O)
* @author Ben Ballard (rewrote advQuoted to handle '""' and for readability)
*/
public class CSV {

public static final char DEFAULT_SEP = ',';

/** Construct a CSV parser, with the default separator (`,'). */
public CSV() {
this(DEFAULT_SEP);
}

/** Construct a CSV parser with a given separator.
* @param sep The single char for the separator (not a list of
* separator characters)
*/
public CSV(char sep) {
fieldSep = sep;
}

/** The fields in the current String */
protected List list = new ArrayList();

/** the separator char for this parser */
protected char fieldSep;

/** parse: break the input String into fields
* @return java.util.Iterator containing each field
* from the original as a String, in order.
*/
public List parse(String line)
{
StringBuffer sb = new StringBuffer();
list.clear(); // recycle to initial state
int i = 0;

if (line.length() == 0) {
list.add(line);
return list;
}

do {
            sb.setLength(0);
            if (i < line.length() && line.charAt(i) == '"')
                i = advQuoted(line, sb, ++i); // skip quote
            else
                i = advPlain(line, sb, i);
            list.add(sb.toString());
            Debug.println("csv", sb.toString());
i++;
} while (i < line.length());

return list;
}

/** advQuoted: quoted field; return index of next separator */
protected int advQuoted(String s, StringBuffer sb, int i)
{
int j;
int len= s.length();
        for (j=i; j<len; j++) {
            if (s.charAt(j) == '"' && j+1 < len) {
                if (s.charAt(j+1) == '"') {
                    j++; // skip escape char
                } else if (s.charAt(j+1) == fieldSep) { //next delimeter
                    j++; // skip end quotes
                    break;
                }
            } else if (s.charAt(j) == '"' && j+1 == len) { // end quotes at end of line
                break; //done
}
sb.append(s.charAt(j)); // regular character.
}
return j;
}

/** advPlain: unquoted field; return index of next separator */
protected int advPlain(String s, StringBuffer sb, int i)
{
int j;

j = s.indexOf(fieldSep, i); // look for separator
Debug.println("csv", "i = " + i + ", j = " + j);
        if (j == -1) {                // none found
            sb.append(s.substring(i));
            return s.length();
        } else {
            sb.append(s.substring(i, j));
            return j;
        }
    }
}

import java.util.Iterator;
import java.util.List;

import junit.framework.TestCase;

/**
* @author ian
*/
public class CSVTest extends TestCase {

public static void main(String[] args) {
junit.textui.TestRunner.run(CSVTest.class);
}

CSV csv = new CSV();

String[] data = {
"abc",
"hello, world",
"a,b,c",
"a\"bc,d,e",
"\"a,a\",b,\"c:\\foo\\bar\"",
"\"he\"llo",
"123,456",
"\"L\"\",U\",86.25,\"11/4/1998\",\"2:19PM\",+4.0625",
"bad \"input\",123e01",
//"XYZZY,\"\"|\"OReilly & Associates| Inc."|"Darwin| Ian"|"a \"glug\" bit|"|5|"Memory fault| core NOT dumped"

};
int[] listLength = {
1,
2,
3,
3,
3,
1,
2,
5,
2
};

/** test all the Strings in "data" */
public void testCSV() {
for (int i = 0; i < data.length; i++){
List l = csv.parse(data[i]);
assertEquals(l.size() , listLength[i]);
for (int k = 0; k < l.size(); k++){
System.out.print("[" + l.get(k) + "],");
}
System.out.println();
}
}

/** Test one String with a non-default delimiter */
public void testBarDelim() {
// Now test slightly-different string with a non-default separator
CSV parser = new CSV('|');
List l = parser.parse(
"\"LU\"|86.25|\"11/4/1998\"|\"2:19PM\"|+4.0625");
assertEquals(l.size(), 5);
Iterator it = l.iterator();
while (it.hasNext()) {
System.out.print("[" + it.next() + "],");
}
System.out.println();
}
}

#5

不用考虑换行的问题，我为了清楚地让大家看清楚，所以才要换行写的！

#6

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author ly
* @date 2005-12-9
* @version 0.9
*/
public class TestRegular
{
public static void main(String[] args)
{
String reg="((\\d)*@)*([^=]*)=(\"([^\"]*((\"\")*[^\"]*(\"\")*)*[^\"]*)\"|([^\"]*((\"\")*[^\"]*(\"\")*)*[^\"]*)),";
Pattern pattern=Pattern.compile(reg);
String s="22@3@3@文件2 =\"数据表0\"\"dd,dd\"\"1.doc\",22@3@3@文件１=\"数据表表0\"\"dd,dd\"\"1.doc\",25@3@7@附件２=doc,22@45@8@附件3=\"\"\",22@3@7@test\",22@3@9@附件4=\"\"\",22@3@7@test\"\"\",2@30@10@附件5=\"\"\"aaaaaa,22@3@7@test=fffff\",22@3@11@附件6=\"aaaaaa,22@3@7@\"";
Matcher matcher=pattern.matcher(s);
while(matcher.find())
{
System.out.print(matcher.group(3)+"=");
System.out.println(matcher.group(4).replaceAll("\"\"", "'").replaceAll("\"", "").replaceAll("'", "\""));
}
}
}

#7

对于有换行符的可以做到，没换行符的难度觉得大些

package com.knightrcom;

import java.io.IOException;

public class TestEnvironment {

    public static void main(String args[]) throws InterruptedException {
        try {
            RegexParser.fileLoad("C:\\ok.txt");
            RegexParser.getAllNeeded();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

package com.knightrcom;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

/* Sample Text
22@3@3@文件2 ="数据表0""dd,dd""1.doc",
22@3@3@文件１="数据表表0""dd,dd""1.doc",
25@3@7@附件２=doc,
22@45@8@附件3=""",22@3@7@test",
22@3@9@附件4=""",22@3@7@test""",
2@30@10@附件5="""aaaaaa,22@3@7@test=fffff",
22@3@11@附件6="aaaaaa,22@3@7@"
*/
public class RegexParser {

    private static String content = "";

    public static void fileLoad(String path) throws IOException{
        File file = new File(path);
        FileReader in = new FileReader(file);
        int r;
        System.out.println("File loaded.");
        while ((r = in.read()) != -1){
            content += (char)r;
        }
        in.close();
        System.out.println(content);
        System.out.println();
        System.out.println();
    }

    public static void getAllNeeded(){
        try {
            String[] results = splitCause(content);
            content = "";
            for (String result: results){
                result = result.replace("\"\"", "^_^");
                result = result.replace("\"", "");
                result = result.replace("^_^", "\"");
                result = result.replace("\r\n", "");
                result = result.replaceAll("^(.*?@){3}", "");
                System.out.println(result);
            }
        } catch (PatternSyntaxException ex) {
            // Syntax error in the regular expression
        }
    }

    private static String[] splitCause(String target){
        String[] SplitArray = null;
        try {
            Pattern regex = Pattern.compile(",(?=\\r\\n)",
                Pattern.CANON_EQ);
            SplitArray = regex.split(target);
        } catch (PatternSyntaxException ex) {
            // Syntax error in the regular expression
            return null;
        }
        return SplitArray;
    }

}

运行结果

File loaded.
22@3@3@文件2 ="数据表0""dd,dd""1.doc",
22@3@3@文件１="数据表表0""dd,dd""1.doc",
25@3@7@附件２=doc,
22@45@8@附件3=""",22@3@7@test",
22@3@9@附件4=""",22@3@7@test""",
2@30@10@附件5="""aaaaaa,22@3@7@test=fffff",
22@3@11@附件6="aaaaaa,22@3@7@"

文件2 =数据表0"dd,dd"1.doc
文件１=数据表表0"dd,dd"1.doc
附件２=doc
附件3=",22@3@7@test
附件4=",22@3@7@test"
附件5="aaaaaa,22@3@7@test=fffff
附件6=aaaaaa,22@3@7@

#8

执行结果:

文件2 =数据表0"dd,dd"1.doc
文件１=数据表表0"dd,dd"1.doc
附件２=doc
附件3=",22@3@7@test
附件4=",22@3@7@test"
附件5="aaaaaa,22@3@7@test=fffff

#9

#10

#11

解析表达式：
\d+@\d+@\d+@([^=]+)=("([^"]|"")*"|[^,]*),?

测试工具：（推荐）
http://www.regexlab.com/zh/workshop.asp?pat=%5Cd%2B@%5Cd%2B@%5Cd%2B@%28%5B%5E%3D%5D%2B%29%3D%28%22%28%5B%5E%22%5D%7C%22%22%29*%22%7C%5B%5E%2C%5D*%29%2C%3F&txt=22@3@3@%u6587%u4EF62%20%3D%22%u6570%u636E%u88680%22%22dd%2Cdd%22%221.doc%22%2C%0D%0A22@3@3@%u6587%u4EF6%uFF11%3D%22%u6570%u636E%u8868%u88680%22%22dd%2Cdd%22%221.doc%22%2C%0D%0A25@3@7@%u9644%u4EF6%uFF12%3Ddoc%2C%0D%0A22@45@8@%u9644%u4EF63%3D%22%22%22%2C22@3@7@test%22%2C%0D%0A22@3@9@%u9644%u4EF64%3D%22%22%22%2C22@3@7@test%22%22%22%2C%0D%0A2@30@10@%u9644%u4EF65%3D%22%22%22aaaaaa%2C22@3@7@test%3Dfffff%22%2C%0D%0A22@3@11@%u9644%u4EF66%3D%22aaaaaa%2C22@3@7@%22

#12

现知道了搂主前一个帖子的用意。

我回答搂主前一个贴的时候，不知道 "" 应该表示空字符串，因此我的回答不完全正确。
因此，应该对第一个引号好最后一个引号特别照顾，才能使 "" => （空字符串）

表达式如下：
^"|"$|"(")

替换为：
$1

代码如下：
String str = "\"begin\"\"my value\"\"end\"\"AAA\"\"BBB\"";
str = str.replaceAll("^\"|\"$|\"(\")", "$1");
System.out.println(str);

当然，按照搂主提供的规律，空字符串除了可以使用 ="", 表示外，也可以使用 =, 来表示。

#13

完整的 Java 代码如下：

-----------------------------------------------------------
Pattern p1 = Pattern.compile("\\d+@\\d+@\\d+@([^=]+)=(\"([^\"]|\"\")*\"|[^,]*),?");
Pattern p2 = Pattern.compile("^\"|\"$|\"(\")");

String text =
"22@3@3@文件2 =\"数据表0\"\"dd,dd\"\"1.doc\",\n" +
"22@3@3@文件１=\"数据表表0\"\"dd,dd\"\"1.doc\",\n" +
"25@3@7@附件２=doc,\n" +
"22@45@8@附件3=\"\"\",22@3@7@test\",\n" +
"22@3@9@附件4=\"\"\",22@3@7@test\"\"\",\n" +
"2@30@10@附件5=\"\"\"aaaaaa,22@3@7@test=fffff\",\n" +
"22@3@11@附件6=\"aaaaaa,22@3@7@\"";

Matcher m1 = p1.matcher(text);
while(m1.find())
{
    System.out.print(m1.group(1));
    System.out.print("=");
    System.out.println(p2.matcher(m1.group(2)).replaceAll("$1"));
}

------------------------------------------------------------
输出结果：

文件2 =数据表0"dd,dd"1.doc
文件１=数据表表0"dd,dd"1.doc
附件２=doc
附件3=",22@3@7@test
附件4=",22@3@7@test"
附件5="aaaaaa,22@3@7@test=fffff
附件6=aaaaaa,22@3@7@

-----------------------------------------------------------
更多帮助，推荐参考正则表达式帮助文档：
http://www.regexlab.com/zh/regref.htm

#1

#2