一、String.matches() ## 用于过滤需要处理的日志(如空格空行错误字符)
语句: "!123".matches("[a-zA-Z0-9]{4}") //false "34Az".matches("[a-zA-Z0-9]{4}") //true
// 应用: // 1.scala读取log def readFromTxt(filePath:String): Array[String] ={ import scala.io.Source val source = Source.fromFile(filePath,"UTF-8") val lines = source.getLines().toArray source.close() lines } //2. 应用于过滤日志需要的信息
// regex里三个""",就不需要转义了! val reg = """([A-Z]+) ([0-9]{4}-[0-9]{1,2}-[0-9]{1,2}) requestURI:(.*)""".r // 先过滤空格,再map lines.filter(_.matches("""([A-Z]+) ([0-9]{4}-[0-9]{1,2}-[0-9]{1,2}) requestURI:(.*)""")) .map(line=>line match{ case reg(level,logdate,addr)=>(level,logdate,addr) }).foreach(println(_))
----补充LOG日志-----
INFO 2000-10-01 requestURI:/c?app=0&p=1&did=180042334&industry=45Z
INFO 2012-11-11 requestURI:/c?app=2&p=3&did=140042334&industry=42Z
WARN 2012-11-11 requestURI:/c?app=2&p=3&did=140042334&industry=42Z
ERROR 2012-11-11 requestURI:/c?app=2&p=3&did=140042334&industry=42Z
二、case模式匹配(推荐使用,最方便)
模式匹配/模式守卫/类型匹配:https://blog.csdn.net/lyq7269/article/details/107759026
例1
// 语句1: val pattern = "([a-zA-Z][0-9][a-zA-Z] [0-9][a-zA-Z][0-9])".r "L3R 6M2" match { case pattern(x) => println("Valid zip-code: " + x ) //x为第1个分组结果,可以匹配多个分组 case x => println("Invalid zip-code: " + x ) } // 语句2: val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r "2014-05-23" match { case date(year, month, day) => println(year,month,day) } "2014-05-23" match { case date(year, _*) => println("The year of the date is " + year) } "2014-05-23" match { case date(_*) => println("It is a date") }
例2
val reg = """.* set se[0-9]_([0-9]+)_([0-9]+)_([0-9]+)r (.*),.*""".r rdd.foreach { case reg(zs, stu, ques, sa) => println(zs, stu, ques,sa) }
匹配log如下,取红色字段
2019-06-16 14:24:34 INFO com.noriental.praxissvr.answer.util.PraxisSsdbUtil:45 [SimpleAsyncTaskExecutor-1] [020765925160] req: set se0_34434412_8195023659593_8080r 1,resp: ok 14
注意点:使用模式匹配虽然方便,但是要注意reg中的括号一定不能镶嵌,比如匹配整数or小数时, ([0-9](\.[0-9])?) 会因为找不到哪个括号而报错!最好使用 (.*)
三、import scala.util.matching.Regex API
1)findFirstMatchIn() 返回第一个匹配(Option[Match])
语句: import scala.util.matching.Regex val numberPattern: Regex = "[0-9]".r numberPattern.findFirstMatchIn("awesomepassword") match { case Some(_) => println("Password OK") //匹配成功 case None => println("Password must contain a number") //未匹配 }
2)分组处理
findAllMatchIn().toList => List[Regex.Match]
例1
语句2: import scala.util.matching.Regex val studentPattern:Regex="([0-9a-zA-Z-#() ]+):([0-9a-zA-Z-#() ]+)".r val input="name:Jason,age:19,weight:100" for(patternMatch<-studentPattern.findAllMatchIn(input)){ println(s"key: ${patternMatch.group(1)} value: ${patternMatch.group(2)}") }
例2
rdd.map(line=>{ val reg = """.* set se[0-9]_([0-9]+)_([0-9]+)_([0-9]+)r ([0-9](\.[0-9])?),.*""".r reg.findAllMatchIn(line).map(x=>(x.group(1),x.group(2),x.group(3),x.group(4)) .productIterator.mkString("\t")).mkString("") }).foreach(println(_))
匹配log如下,取红色字段
2019-06-16 14:24:34 INFO com.noriental.praxissvr.answer.util.PraxisSsdbUtil:45 [SimpleAsyncTaskExecutor-1] [020765925160] req: set se0_34434412_8195023659593_8080r 1,resp: ok 14
3)字符串处理
1.字符串中替换
replaceFirstIn("长字符串","需要替换成什么字符")
replaceAllIn("长字符串","需要替换成什么字符")
语句1: "[0-9]+".r.replaceFirstIn("234 Main Street Suite 2034", "567") //234->567 "[0-9]+".r.replaceAllIn("234 Main Street Suite 2034", "567") //234、2034->567
2.
字符串中查找:findAllIn().toList => list[String]
字符串中查找:_用来扔掉不需要的数据,_*用于句末
语句1: val nums = "[0-9]+".r.findAllIn("123 Main Street Suite 2012").toList.foreach(println(_)) 语句2: val date = """(\d\d\d\d)-(\d\d)-(\d\d)""".r "2014-05-23" match { case date(year, month, day) => println(year,month,day) } "2014-05-23" match { case date(year, _*) => println("The year of the date is " + year) } "2014-05-23" match { case date(_*) => println("It is a date") }