计算字符串中单词之间的空格

I have a file that contains:

我有一个包含以下内容的文件:

user_name     order_id     M_Status
jOHN          1000         married to Emma

each "Column" is separated from the following one by 5 spaces, the spaces count can change in another string, and since there is a single space between each word under M_Status column splitting by (" +") didn't work since the M_Status need to be one string, so what I'm trying to do is count the spaces between words in the first line then split all the remaining lines by the correct number of spaces (5 but could change in another file).

每个“列”与下面的一个空格分隔5个空格,空格数可以在另一个字符串中更改,并且因为在M_Status列下每个单词之间有一个空格,因此分割为(“+”)因为M_Status不起作用需要是一个字符串,所以我要做的是计算第一行中单词之间的空格然后用正确的空格数分割所有剩余的行(5但可以在另一个文件中更改)。

UPDATE:

val delimitersList = List(",", ";", ":", "\\|", "\\t", " ")

def findCommonDelimiter(line: String, sep: Option[String], typeToCheck: String): (List[String], String) = {
  val delimiterMap = scala.collection.mutable.LinkedHashMap[String, Int]()// this needs to be changed to find how many times a delimiter is repeated between two columns
  for (a <- delimitersList)
    delimiterMap += a -> (a + "+").r.findAllIn(line).length

  try {
    val sortedMap = (delimiterMap.toList sortWith ((x, y) => x._2 > y._2)).take(3)
    var splitChar = ""
    val firstDelimiter = sortedMap.head._1.toString
    val firstDelimiterCount = sortedMap.head._2
    val secondDelimiter = sortedMap.drop(1).head._1.toString
    val secondDelimiterCount = sortedMap.drop(1).head._2
    val thirdDelimiter=sortedMap.drop(2).head._1.toString
    val lineSplit=line.split("\\r?\\n")
    if (!firstDelimiter.equalsIgnoreCase(",") &&
       secondDelimiter.equalsIgnoreCase(",") &&
       secondDelimiterCount > 0 &&
       !typeToCheck.equalsIgnoreCase("map") {//(firstDelimiterCount - commaCount) <= 1 && commaCount > 0) {
      splitChar = ","
    } else if (firstDelimiter.equalsIgnoreCase(" ") || firstDelimiter.equalsIgnoreCase("\\t")) {
      if (lineSplit(0).split(thirdDelimiter, 2).length == 2 &&
         typeToCheck.equalsIgnoreCase("map") &&
         ((secondDelimiter.equalsIgnoreCase(",") &&
         secondDelimiterCount > 0) || (secondDelimiter.equalsIgnoreCase(";") && secondDelimiterCount > 0))) {
        splitChar = thirdDelimiter
      } else if (lineSplit(0).split(secondDelimiter,2).length == 2 && typeToCheck.equalsIgnoreCase("map")) {
        splitChar = secondDelimiter
      } else if (typeToCheck.equalsIgnoreCase("header") && firstDelimiter.equalsIgnoreCase("\\t")) {
        splitChar = "\\t"
      } else if (typeToCheck.equalsIgnoreCase("header") &&
                firstDelimiter.equalsIgnoreCase(" ") &&
                secondDelimiterCount > 0) {
        if ((firstDelimiterCount- secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
      } else {
        if (firstDelimiter.equalsIgnoreCase(" ") &&
           secondDelimiterCount > 0 &&
           (firstDelimiterCount - secondDelimiterCount >= firstDelimiterCount / 2))
          splitChar = secondDelimiter
        else
          splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)
      }
    } else
      splitChar = (sortedMap.maxBy(_._2)._1).toString //.take(1)

    if (!splitChar.equalsIgnoreCase("""\|""") && !splitChar.equalsIgnoreCase("\\t")) {
      // println("===>"+splitChar)
      // if(!splitChar.equalsIgnoreCase(""))
      (line.split(splitChar, -1).toList, splitChar)
    } else {
      if (splitChar.equalsIgnoreCase("""\|"""))
        (line.split("\\|", -1).toList, splitChar)
      else
        (line.split("\\t", -1).toList, splitChar)
    }
  } catch {
    case e: Exception => {
      e.printStackTrace()
      (List(line), "")
    }
  }
}

Thanks

4 个解决方案

#1

I've hacked together some code to get your spaces. Kind of long winded but it works.

我已经破解了一些代码来获取你的空间。有点长啰嗦但它的确有效。

Borrowing @Kevin Wright split function from here

从这里借用@Kevin Wright分裂功能

def split[T](list: List[T]) : List[List[T]] = list match {
  case Nil => Nil
  case h::t => val segment = list takeWhile {h ==}
    segment :: split(list drop segment.length)
}

You can go:

你可以走了:

scala> val line = "JOHN     1000     married to Emma"
line: String = JOHN     1000     married to Emma

scala> val lengthOfSpaces = split(line.toCharArray.toList).
     | filter(x => x.head.equals(' ') && x.size > 1).
     | map(y => y.length).
     | distinct.head
lengthOfSpaces: Int = 5

scala> line.split(" " * lengthOfSpaces)
res39: Array[String] = Array(JOHN, 1000, married to Emma)

Will also work if you have extra columns:

如果您有额外的列,也可以工作:

scala> val line2 = "jOHN     1000     231 any street     married to Emma"
line2: String = jOHN     1000     231 any street     married to Emma

scala> line2.split(" " * lengthOfSpaces)
res47: Array[String] = Array(jOHN, 1000, 231 any street, married to Emma)

I've made the assumption that the spaces between columns will be a uniform value in EACH line. So you can't have 5 spaces between user_name and order_id, and 4 spaces between order_id and the next column.

我假设列之间的空格在每一行中都是一个统一的值。因此,user_name和order_id之间不能有5个空格,order_id和下一列之间不能有4个空格。

Also, if you're going to have data with same number of spaces between columns and words, perhaps you should normalise your data first. @jan0sch had earlier suggested the spaces be done with tabs.

此外,如果您要在列和单词之间使用相同数量的空格,则可能应首先规范化数据。 @ jan0sch之前曾建议使用制表符完成空格。

#2

You can use \\s+ for split multi spaces with the limit param to limit the split results size., like:

您可以使用\\ s +与限制参数一起使用限制参数来限制拆分结果大小。,例如:

scala> "jOHN     1000     married to Emma".split("\\s+", 3)
res5: Array[String] = Array(jOHN, 1000, married to Emma)

#3

I've edited the solution to reflect your actual problem. This is somewhat overkill but will solve it. First we analyse the header line to calculate the spaces. For that we assume that you know the number of columns. Then the rest is simply building the appropriate split parameter.

我编辑了解决方案以反映您的实际问题。这有点矫枉过正,但会解决它。首先,我们分析标题行来计算空格。为此,我们假设您知道列数。然后剩下的就是构建适当的split参数。

@ val h = "user_name     order_id     M_Status" 
h: String = "user_name     order_id     M_Status"
@ val c = (h.split("\\s+").fold("")(_ ++ _).length - h.foldLeft(0)((a, b) => if (b == ' ') a + 1 else a)) / 3                                                                                                        
c: Int = 5                                                                                               
@ " jOHN     1000     married to Emma".split(s" {$c}") 
res18: Array[String] = Array(" jOHN", "1000", "married to Emma")

Better would be to also calculate the number of columns though...

更好的是还要计算列数...

#4

You could just use regex to split the line at more than 1 space cases. No need to count them though.

你可以使用正则表达式来分割超过1个空格的行。不过不需要数数。

scala> "jOHN Doe     1000     married to Emma".split("""[\s]{2,}""")
res1: Array[String] = Array(jOHN Doe, 1000, married to Emma)

#1