
时间:2023-01-16 16:40:39

I'm trying to extract salary information from emails and job adverts.


I need a regex that will return the first instance of the range or salary number (I also want to avoid matching phone numbers that might occur later in the string) e.g.


"blah blah £500 blah".match(regex)
=> "£500"
"balh blah £500-650 blah".match(regex)
=> "£500-650"
"£50 per hour".match(regex) 
=> "£50" (or "£50 per hour" for an advanced version)
"blah blah £50k blah".match(regex)
=> "£50k"
"bblah blah 50-60k".match(regex)
=> "50-60k"
"blah blah 50000 blahblahblah".match(regex)  
=> "50000"
"blah 50000 - 60000 blablahblah 0207-123-4567".match(regex) 
=> "50000 - 60000"
#"blah 350 to 425 blah".match(regex) 
#=> "350 to 425" Can forget this last one as it's a bit of an edge case

I've got this far.


/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9]{0,5}[0,5,k]?)?/

Or a slightly enhanced:


/(£|$)?[0-9][0-9]{0,5}[0-9,k]?(-| - | to )?((£|$)?[0-9][0-9][0,5,k]?)? ?(ph|pw|pa| per|per)? ?(hour|annum|week|month)?/

Sort of works but doesn't match the whole substring number range, there seems to be lots of matches of the individual pieces.



"bblah blah 50-60k".match(regex)
=> #<MatchData "50-60k" 1:nil 2:"-" 3:"60k" 4:nil>

I want it to just say



What am I missing (and is there a more elegant way to do this?)


2 个解决方案



I concluded it would be best to do this in two steps. First, let's remove phone numbers:


r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 

I've assumed that all phone numbers are three strings of digits separated by hyphens that are optionally surrounded with spaces. If that assumption is not correct you would of course have to modify r0 as appropriate.


Now extract the values desired from the elements of arr1:


r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"] 



Here is the new update.


((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))

This should work for all scenarios.


Here is one that includes the "per hour"


((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w| per hour)|(?<=\s)\d{1,}(?=\s))



I concluded it would be best to do this in two steps. First, let's remove phone numbers:


r0 = /
     \d+        # Match one or more digits
     (?:        # Begin a non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+      # Match one or more digits
     ){2}       # Close non-capture and perform it twice
    /x          # Extended/free-spacing mode

arr0 = ["blah blah £500 blah",
        "balh blah £500-650 blah",
        "£50 per hour",
        "blah blah £50k blah",
        "bblah blah 50-60k",
        "blah blah 50000 blahblahblah",
        "blah 50000 - £60000 blablahblah 0207-123-4567",
        "bblah blah 50k-£60k"

arr1 = arr0.map { |str| str.gsub(r0,'') }
  #=> ["blah blah £500 blah",
  #    "balh blah £500-650 blah",
  #    "£50 per hour",
  #    "blah blah £50k blah",
  #    "bblah blah 50-60k",
  #    "blah blah 50000 blahblahblah",
  #    "blah 50000 - £60000 blablahblah ",
  #    "bblah blah 50k-£60k"] 

I've assumed that all phone numbers are three strings of digits separated by hyphens that are optionally surrounded with spaces. If that assumption is not correct you would of course have to modify r0 as appropriate.


Now extract the values desired from the elements of arr1:


r1 = /
     £?         # Optionally begin with a pound sign
     \d+k?      # Match one or more digits optionally followed by k
     (?:        # Begin non-capture group
       \s*\-\s* # Match a hypen optionally surrounded with spaces
       \d+k?    # Match one or more digits optionally followed by k
     )?         # End non-capture group and make the match optional
     \b         # word break
     /x         # Extended/free-spacing mode

arr1.map { |s| s[r1] }
  #=> ["£500", "£500-650", "£50", "£50k", "50-60k", "50000", "50000", "50k"] 



Here is the new update.


((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w)|(?<=\s)\d{1,}(?=\s))

This should work for all scenarios.


Here is one that includes the "per hour"


((?:£)\d{1,}\-(?:\d{1,}(?:\-|\w))|\d{1,} - \d{1,}|\d{1,} to \d{1,}|(?<=\s)\d{1,}\-\d{1,}(?:k|(?=\s))|(?:£)\d{1,}(?:\w| per hour)|(?<=\s)\d{1,}(?=\s))