I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
我想使用正则表达式提取文本的一部分。例如,我有一个地址,想要只返回数字和街道,并排除其余的:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
但是大多数时候地址的格式各不相同。我尝试使用Lookbehind Regex并推出了这个表达式:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
上面的表达式可以很好地处理上面的例子,但是一旦逗号进入文本就会变得太乱,邮政编码可以是6个字符的字符串,或者两个3个字符的字符串,中间有空格等等...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
是否有更优雅的方式来提取除了后视正则表达式之外的一部分文本?
Any suggestion or a point in another direction is greatly appreciated.
非常感谢任何建议或在另一个方向上的观点。
Thanks!
3 个解决方案
#1
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
正则表达式用于REGULAR的数据,遵循模式。因此,如果您的数据完全是随机的,不,那么使用正则表达式就没有优雅的方法。
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
另一方面,如果你知道你想要什么值,你可以写一些简单的正则表达式,然后在每个字符串上测试它们。
Ex. regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
防爆。 regex1 =地址#grabber,regex2 =街道类型抓取器,regex3 =名称抓取器。
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
尝试使用regex1,regex2和最后的regex3匹配string1。转到下一个字符串。
#2
well i thot i'd throw my hat into the ring:
好吧,我把我的帽子扔进戒指:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^
or \d+
at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
并且您可能希望在前面使用^或\ d +以获得良好的衡量标准,而且我没有费心指定邮政编码的长度......只是在这个字符串中的任何数量的字符连字符。
it works for these inputs so far and variations on comas within the City/state/country area:
到目前为止,它适用于这些输入以及城市/州/国家/地区内的昏迷的变化:
- 2222 Main at King Edward Vancouver, BC, CA, 333-333
2222 Main at Edward Edward Vancouver,BC,CA,333-333
- 555 road and street place CA US 95000
555路和街道CA US 95000
- 2222 Main at King Edward Vancouver BC CA 333
2222 Main at Edward Edward Vancouver BC CA 333
- 555 road and street place CA US
加州美国555公路和街道
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
它指的是城市,州和国家最后有三个字,但除此之外,就像ryansstack说的那样,如果它是随机的,它将无法工作。如果这个城市像纽约这样的两个词就行不通。是的...正则表达式不是这个的工具。
btw: tested on regexhero.net
顺便说一句:在regexhero.net上测试过
#3
i can think of 2 ways you can do this
我可以想到两种方法可以做到这一点
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
1)如果您知道地址之后的数据的“其余”正好是2个字段,即BC和CA,则可以使用空格作为分隔符对字符串进行拆分,删除最后2个项目。
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)
2)对分隔符/ [A-Z] [A-Z] /进行拆分并将结果存储在数组中。然后打印出数组(假设地址不包含2个或更多大写字母)
#1
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
正则表达式用于REGULAR的数据,遵循模式。因此,如果您的数据完全是随机的,不,那么使用正则表达式就没有优雅的方法。
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
另一方面,如果你知道你想要什么值,你可以写一些简单的正则表达式,然后在每个字符串上测试它们。
Ex. regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
防爆。 regex1 =地址#grabber,regex2 =街道类型抓取器,regex3 =名称抓取器。
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
尝试使用regex1,regex2和最后的regex3匹配string1。转到下一个字符串。
#2
well i thot i'd throw my hat into the ring:
好吧,我把我的帽子扔进戒指:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^
or \d+
at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
并且您可能希望在前面使用^或\ d +以获得良好的衡量标准,而且我没有费心指定邮政编码的长度......只是在这个字符串中的任何数量的字符连字符。
it works for these inputs so far and variations on comas within the City/state/country area:
到目前为止,它适用于这些输入以及城市/州/国家/地区内的昏迷的变化:
- 2222 Main at King Edward Vancouver, BC, CA, 333-333
2222 Main at Edward Edward Vancouver,BC,CA,333-333
- 555 road and street place CA US 95000
555路和街道CA US 95000
- 2222 Main at King Edward Vancouver BC CA 333
2222 Main at Edward Edward Vancouver BC CA 333
- 555 road and street place CA US
加州美国555公路和街道
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
它指的是城市,州和国家最后有三个字,但除此之外,就像ryansstack说的那样,如果它是随机的,它将无法工作。如果这个城市像纽约这样的两个词就行不通。是的...正则表达式不是这个的工具。
btw: tested on regexhero.net
顺便说一句:在regexhero.net上测试过
#3
i can think of 2 ways you can do this
我可以想到两种方法可以做到这一点
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
1)如果您知道地址之后的数据的“其余”正好是2个字段,即BC和CA,则可以使用空格作为分隔符对字符串进行拆分,删除最后2个项目。
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)
2)对分隔符/ [A-Z] [A-Z] /进行拆分并将结果存储在数组中。然后打印出数组(假设地址不包含2个或更多大写字母)