Regex用于在斜线结尾或第一个问号之前匹配所有内容?

时间:2020-12-09 22:26:15

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.

我正试图找到一个regex,它将优雅地匹配一个URL后面的所有内容,然后在第一个,最后一个斜杠,或者URL的末尾,如果两个都不存在。

This is what I came up with but it seems to be failing in some cases:

这是我想到的,但在某些情况下似乎失败了:

regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/

In summary:

总而言之:

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/应该返回2013/07/31 / a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id = 2应该返回2013/07/31 / a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price应该返回2013/07/31 / a-new-health-care-approach-dont-hide-the-price

2 个解决方案

#1


4  

If lookaheads are allowed

如果允许超前

((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)

Copy + Paste this in http://regexpal.com/

复制+粘贴到http://regexpal.com/中

See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz

这里有ruby regex测试器:http://rubular.com/r/uoLLvTwkaz

Image using javascript regex, but it works out the same

使用javascript regex生成图像,但结果是一样的

Regex用于在斜线结尾或第一个问号之前匹配所有内容?

(?=) is just a a lookahead

只是一种展望

I basically set up three matches from 2XXX up to (in this order):

我基本上设置了从2XXX到(按这个顺序)的3个匹配:

(?=\?\w+)  # lookahead for a question mark followed by one or more word characters
(?=/\s+)   # lookahead for a slash         followed by one or more whitespace characters
.*\w       # match up to the last word character

I'm pretty sure that some parentheses were not needed but I just copy pasted.

我很确定不需要括号,但我只是复制粘贴。

There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.

在(|B|C)表达式中本质上有两个或|表达式。订单很重要,因为这就像一个(如果|elseif|)类型的交易。

You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.

你可能会修正前缀,我假设你想要2XXX X是匹配的数字。

Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.

此外,请保存干草叉,正则表达式并不总是最好的,但是当你需要的时候,它就在那里。

Also, there is xkcd (https://xkcd.com/208/) for everything:

还有xkcd (https://xkcd.com/208/):

Regex用于在斜线结尾或第一个问号之前匹配所有内容?

#2


8  

Please don't use Regex for this. Use the URI library:

请不要使用Regex。使用URI库:

require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path

Why?

See everything about this famous question for a good discussion of why these kinds of things are a bad idea.

看看关于这个著名问题的一切,好好讨论一下为什么这类事情是一个坏主意。

Also, this XKCD really says why: Regex用于在斜线结尾或第一个问号之前匹配所有内容?

此外,这个XKCD确实说明了为什么:

In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?

简而言之,Regexes是一种非常强大的工具,但是当您处理的是由100页复杂的标准组成的东西时,当已经有一个库可以更快、更容易、更正确地完成它时,为什么要重新发明这个*呢?

#1


4  

If lookaheads are allowed

如果允许超前

((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)

Copy + Paste this in http://regexpal.com/

复制+粘贴到http://regexpal.com/中

See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz

这里有ruby regex测试器:http://rubular.com/r/uoLLvTwkaz

Image using javascript regex, but it works out the same

使用javascript regex生成图像,但结果是一样的

Regex用于在斜线结尾或第一个问号之前匹配所有内容?

(?=) is just a a lookahead

只是一种展望

I basically set up three matches from 2XXX up to (in this order):

我基本上设置了从2XXX到(按这个顺序)的3个匹配:

(?=\?\w+)  # lookahead for a question mark followed by one or more word characters
(?=/\s+)   # lookahead for a slash         followed by one or more whitespace characters
.*\w       # match up to the last word character

I'm pretty sure that some parentheses were not needed but I just copy pasted.

我很确定不需要括号,但我只是复制粘贴。

There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.

在(|B|C)表达式中本质上有两个或|表达式。订单很重要,因为这就像一个(如果|elseif|)类型的交易。

You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.

你可能会修正前缀,我假设你想要2XXX X是匹配的数字。

Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.

此外,请保存干草叉,正则表达式并不总是最好的,但是当你需要的时候,它就在那里。

Also, there is xkcd (https://xkcd.com/208/) for everything:

还有xkcd (https://xkcd.com/208/):

Regex用于在斜线结尾或第一个问号之前匹配所有内容?

#2


8  

Please don't use Regex for this. Use the URI library:

请不要使用Regex。使用URI库:

require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path

Why?

See everything about this famous question for a good discussion of why these kinds of things are a bad idea.

看看关于这个著名问题的一切,好好讨论一下为什么这类事情是一个坏主意。

Also, this XKCD really says why: Regex用于在斜线结尾或第一个问号之前匹配所有内容?

此外,这个XKCD确实说明了为什么:

In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?

简而言之,Regexes是一种非常强大的工具,但是当您处理的是由100页复杂的标准组成的东西时,当已经有一个库可以更快、更容易、更正确地完成它时,为什么要重新发明这个*呢?