如何在Python中将url字符串拆分为单独的部分？

I decided that I'll learn python tonight :) I know C pretty well (wrote an OS in it) so I'm not a noob in programming so everything in python seems pretty easy, but I don't know how to solve this problem : let's say I have this address:

我决定今晚学习python :)我非常了解C(在其中写了一个操作系统)所以我不是编程中的菜鸟所以python中的所有东西看起来都很简单,但我不知道如何解决这个问题问题:假设我有这个地址:

http://example.com/random/folder/path.html Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be http://example.com/ and another containing the thing without the last filename, so in this example it would be http://example.com/random/folder/ . Also I of course know the possibility to just find the 3rd and last slash respectively but maybe you know a better way :] Also it would be cool to have the trailing slash in both cases but I don't care since it can be added easily. So anyone has a good, fast, effective solution for this? Or is there only "my" solution, finding the slashes?

http://example.com/random/folder/path.html现在我如何从中创建两个字符串,一个包含服务器的“基本”名称,因此在此示例中它将是http://example.com /和另一个包含没有最后文件名的东西,所以在这个例子中它将是http://example.com/random/folder/。另外我当然知道分别找到第3个和最后一个斜线的可能性,但也许你知道一个更好的方法:]在两种情况下都有尾随斜线也很酷但是我不在乎因为它可以很容易地添加。那么任何人都有一个良好,快速,有效的解决方案吗?或者只有“我的”解决方案,找到斜杠?

Thanks!

6 个解决方案

#1

The urlparse module in python 2.x (or urllib.parse in python 3.x) would be the way to do it.

python 2.x中的urlparse模块(或python 3.x中的urllib.parse)将是这样做的方法。

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

If you wanted to do more work on the path of the file under the url, you can use the posixpath module :

如果您想在url下的文件路径上做更多工作,可以使用posixpath模块:

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

After that, you can use posixpath.join to glue the parts together.

之后,您可以使用posixpath.join将部件粘合在一起。

EDIT: I totally forgot that windows users will choke on the path separator in os.path. I read the posixpath module docs, and it has a special reference to URL manipulation, so all's good.

编辑:我完全忘记了Windows用户会在os.path中的路径分隔符上窒息。我阅读了posixpath模块文档,它有一个特殊的URL操作参考,所以一切都很好。

#2

I have no experience with Python, but I found the urlparse module, which should do the job.

我没有使用Python的经验,但我找到了urlparse模块,它应该可以完成这项工作。

#3

If this is the extent of your URL parsing, Python's inbuilt rpartition will do the job:

如果这是你的URL解析的范围,Python的内置rpartition将完成这项工作:

>>> URL = "http://example.com/random/folder/path.html"
>>> Segments = URL.rpartition('/')
>>> Segments[0]
'http://example.com/random/folder'
>>> Segments[2]
'path.html'

From Pydoc, str.rpartition:

来自Pydoc,str.rpartition:

Splits the string at the last occurrence of sep, and returns a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself

在最后一次出现sep时拆分字符串,并返回包含分隔符之前的部分的3元组,分隔符本身以及分隔符之后的部分。如果找不到分隔符,则返回包含两个空字符串的3元组,后跟字符串本身

What this means is that rpartition does the searching for you, and splits the string at the last (right most) occurrence of the character you specify (in this case / ). It returns a tuple containing:

这意味着rpartition会搜索你,并在你指定的字符的最后(最右边)出现时拆分字符串(在本例中为/)。它返回一个包含以下内容的元组:

(everything to the left of char , the character itself , everything to the right of char)

#4

In Python a lot of operations are done using lists. The urlparse module mentioned by Sebasian Dietz may well solve your specific problem, but if you're generally interested in Pythonic ways to find slashes in strings, for example, try something like this:

在Python中,很多操作都是使用列表完成的。 Sebasian Dietz提到的urlparse模块可能很好地解决了你的具体问题,但是如果你通常对Pythonic的方法感兴趣,比如在字符串中找到斜杠,可以尝试这样的方法:

url = 'http://example.com/random/folder/path.html'
# Create a list of each bit between slashes
slashparts = url.split('/')
# Now join back the first three sections 'http:', '' and 'example.com'
basename = '/'.join(slashparts[:3]) + '/'
# All except the last one
dirname = '/'.join(slashparts[:-1]) + '/'
print 'slashparts = %s' % slashparts
print 'basename = %s' % basename
print 'dirname = %s' % dirname

The output of this program is this:

这个程序的输出是这样的:

slashparts = ['http:', '', 'example.com', 'random', 'folder', 'path.html']
basename = http://example.com/
dirname = http://example.com/random/folder/

The interesting bits are split, join, the slice notation array[A:B] (including negatives for offsets-from-the-end) and, as a bonus, the % operator on strings to give printf-style formatting.

有趣的位是分割,连接,切片表示法数组[A:B](包括从末尾开始的负数),作为奖励,字符串上的%运算符给出printf样式格式。

#5

Thank you very much to the other answerers here, who pointed me in the right direction via the answers they have given!

非常感谢这里的其他回答者,他们通过他们给出的答案指出了我正确的方向!

It seems like the posixpath module mentioned by sykora's answer is not available in my Python setup (python 2.7.3).

似乎sykora的答案中提到的posixpath模块在我的Python设置(python 2.7.3)中不可用。

As per this article it seems that the "proper" way to do this would be using...

根据这篇文章,似乎“正确”的方式是使用......

urlparse.urlparse and urlparse.urlunparse can be used to detach and reattach the base of the URL

urlparse.urlparse和urlparse.urlunparse可用于分离和重新附加URL的基础

The functions of os.path can be used to manipulate the path

os.path的功能可用于操作路径

urllib.url2pathname and urllib.pathname2url (to make path name manipulation portable, so it can work on Windows and the like)

urllib.url2pathname和urllib.pathname2url(使路径名操作可移植,因此它可以在Windows等上运行)

So for example (not including reattaching the base URL)...

例如(不包括重新附加基本URL)......

>>> import urlparse, urllib, os.path
>>> os.path.dirname(urllib.url2pathname(urlparse.urlparse("http://example.com/random/folder/path.html").path))
'/random/folder'

#6

You can use python's library furl:

你可以使用python的库furl:

f = furl.furl("http://example.com/random/folder/path.html")
print(str(f.path))  # '/random/folder/path.html'
print(str(f.path).split("/")) # ['', 'random', 'folder', 'path.html']

To access word after first "/", use:

要在第一个“/”之后访问单词,请使用:

str(f.path)`enter code here`.split("/") # random

#1