
时间:2020-11-27 21:36:34

I have a file that contains:


user_name     order_id     M_Status
jOHN          1000         married to Emma

each "Column" is separated from the following one by 5 spaces, the spaces count can change in another string, and since there is a single space between each word under M_Status column splitting by (" +") didn't work since the M_Status need to be one string, so what I'm trying to do is count the spaces between words in the first line then split all the remaining lines by the correct number of spaces (5 but could change in another file).



val delimitersList = List(",", ";", ":", "\\|", "\\t", " ")

def findCommonDelimiter(line: String, sep: Option[String], typeToCheck: String): (List[String], String) = {
  val delimiterMap = scala.collection.mutable.LinkedHashMap[String, Int]()// this needs to be changed to find how many times a delimiter is repeated between two columns
  for (a <- delimitersList)
    delimiterMap += a -> (a + "+").r.findAllIn(line).length

  try {
    val sortedMap = (delimiterMap.toList sortWith ((x, y) => x._2 > y._2)).take(3)
    var splitChar = ""
    val firstDelimiter = sortedMap.head._1.toString
    val firstDelimiterCount = sortedMap.head._2
    val secondDelimiter = sortedMap.drop(1).head._1.toString
    val secondDelimiterCount = sortedMap.drop(1).head._2
    val thirdDelimiter=sortedMap.drop(2).head._1.toString
    val lineSplit=line.split("\\r?\\n")
    if (!firstDelimiter.equalsIgnoreCase(",") &&
       secondDelimiter.equalsIgnoreCase(",") &&
       secondDelimiterCount > 0 &&
       !typeToCheck.equalsIgnoreCase("map") {//(firstDelimiterCount - commaCount) <= 1 && commaCount > 0) {
      splitChar = ","
    } else if (firstDelimiter.equalsIgnoreCase(" ") || firstDelimiter.equalsIgnoreCase("\\t")) {
      if (lineSplit(0).split(thirdDelimiter, 2).length == 2 &&
         typeToCheck.equalsIgnoreCase("map") &&
         ((secondDelimiter.equalsIgnoreCase(",") &&
         secondDelimiterCount > 0) || (secondDelimiter.equalsIgnoreCase(";") && secondDelimiterCount > 0))) {
        splitChar = thirdDelimiter
      } else if (lineSplit(0).split(secondDelimiter,2).length == 2 && typeToCheck.equalsIgnoreCase("map")) {
        splitChar = secondDelimiter
      } else if (typeToCheck.equalsIgnoreCase("header") && firstDelimiter.equalsIgnoreCase("\\t")) {
        splitChar = "\\t"
      } else if (typeToCheck.equalsIgnoreCase("header") &&
                firstDelimiter.equalsIgnoreCase(" ") &&
                secondDelimiterCount > 0) {
        if ((firstDelimiterCount- secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
      } else {
        if (firstDelimiter.equalsIgnoreCase(" ") &&
           secondDelimiterCount > 0 &&
           (firstDelimiterCount - secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
          splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
    } else
      splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)

    if (!splitChar.equalsIgnoreCase("""\|""") && !splitChar.equalsIgnoreCase("\\t")) {
      // println("===>"+splitChar)
      // if(!splitChar.equalsIgnoreCase(""))
      (line.split(splitChar, -1).toList, splitChar)
    } else {
      if (splitChar.equalsIgnoreCase("""\|"""))
        (line.split("\\|", -1).toList, splitChar)
        (line.split("\\t", -1).toList, splitChar)
  } catch {
    case e: Exception => {
      (List(line), "")


4 个解决方案



I've hacked together some code to get your spaces. Kind of long winded but it works.


Borrowing @Kevin Wright split function from here

从这里借用@Kevin Wright分裂功能

def split[T](list: List[T]) : List[List[T]] = list match {
  case Nil => Nil
  case h::t => val segment = list takeWhile {h ==}
    segment :: split(list drop segment.length)

You can go:


scala> val line = "JOHN     1000     married to Emma"
line: String = JOHN     1000     married to Emma

scala> val lengthOfSpaces = split(line.toCharArray.toList).
     | filter(x => x.head.equals(' ') && x.size > 1).
     | map(y => y.length).
     | distinct.head
lengthOfSpaces: Int = 5

scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)

Will also work if you have extra columns:


scala> val line2 = "jOHN     1000     231 any street     married to Emma"
line2: String = jOHN     1000     231 any street     married to Emma

scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)

I've made the assumption that the spaces between columns will be a uniform value in EACH line. So you can't have 5 spaces between user_name and order_id, and 4 spaces between order_id and the next column.


Also, if you're going to have data with same number of spaces between columns and words, perhaps you should normalise your data first. @jan0sch had earlier suggested the spaces be done with tabs.

此外,如果您要在列和单词之间使用相同数量的空格,则可能应首先规范化数据。 @ jan0sch之前曾建议使用制表符完成空格。



You can use \\s+ for split multi spaces with the limit param to limit the split results size., like:

您可以使用\\ s +与限制参数一起使用限制参数来限制拆分结果大小。,例如:

scala> "jOHN     1000     married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)



I've edited the solution to reflect your actual problem. This is somewhat overkill but will solve it. First we analyse the header line to calculate the spaces. For that we assume that you know the number of columns. Then the rest is simply building the appropriate split parameter.


@ val h = "user_name     order_id     M_Status" 
h: String = "user_name     order_id     M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3                                                                                                        
c: Int = 5                                                                                               
@ " jOHN     1000     married to Emma".split(s" {$c}") 
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")

Better would be to also calculate the number of columns though...




You could just use regex to split the line at more than 1 space cases. No need to count them though.


scala> "jOHN Doe     1000     married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)



I've hacked together some code to get your spaces. Kind of long winded but it works.


Borrowing @Kevin Wright split function from here

从这里借用@Kevin Wright分裂功能

def split[T](list: List[T]) : List[List[T]] = list match {
  case Nil => Nil
  case h::t => val segment = list takeWhile {h ==}
    segment :: split(list drop segment.length)

You can go:


scala> val line = "JOHN     1000     married to Emma"
line: String = JOHN     1000     married to Emma

scala> val lengthOfSpaces = split(line.toCharArray.toList).
     | filter(x => x.head.equals(' ') && x.size > 1).
     | map(y => y.length).
     | distinct.head
lengthOfSpaces: Int = 5

scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)

Will also work if you have extra columns:


scala> val line2 = "jOHN     1000     231 any street     married to Emma"
line2: String = jOHN     1000     231 any street     married to Emma

scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)

I've made the assumption that the spaces between columns will be a uniform value in EACH line. So you can't have 5 spaces between user_name and order_id, and 4 spaces between order_id and the next column.


Also, if you're going to have data with same number of spaces between columns and words, perhaps you should normalise your data first. @jan0sch had earlier suggested the spaces be done with tabs.

此外,如果您要在列和单词之间使用相同数量的空格,则可能应首先规范化数据。 @ jan0sch之前曾建议使用制表符完成空格。



You can use \\s+ for split multi spaces with the limit param to limit the split results size., like:

您可以使用\\ s +与限制参数一起使用限制参数来限制拆分结果大小。,例如:

scala> "jOHN     1000     married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)



I've edited the solution to reflect your actual problem. This is somewhat overkill but will solve it. First we analyse the header line to calculate the spaces. For that we assume that you know the number of columns. Then the rest is simply building the appropriate split parameter.


@ val h = "user_name     order_id     M_Status" 
h: String = "user_name     order_id     M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3                                                                                                        
c: Int = 5                                                                                               
@ " jOHN     1000     married to Emma".split(s" {$c}") 
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")

Better would be to also calculate the number of columns though...




You could just use regex to split the line at more than 1 space cases. No need to count them though.


scala> "jOHN Doe     1000     married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)