Python解析URL字符串中的单词

I have a large data set of urls and I need a way to parse words from the urls eg:

我有一个大的网址数据集，我需要一种方法来解析网址中的单词，例如：

realestatesales.com -> {"real","estate","sales"}

I would prefer to do it in python. This seems like it should be possible with some kind of english language dictionary. There might be some ambiguous cases, but I feel like there should be a solution out there somewhere.

我宁愿在python中这样做。这似乎应该可以使用某种英语词典。可能会有一些模棱两可的案例，但我觉得应该有一个解决方案。

3 个解决方案

#1

This is a problem is word segmentation, and an efficient dynamic programming solution exists. This page discusses how you could implement it. I have also answered this question on SO before, but I can't find a link to the answer. Please feel free to edit my post if you do.

这是一个问题，即分词，并且存在有效的动态编程解决方案。本页讨论如何实现它。我之前也回答了这个问题，但我找不到答案的链接。如果你这样做，请随时编辑我的帖子。

#2

Ternary Search Trees when filled with a word-dictionary can find the most-complex set of matched terms (words) rather efficiently. This is the solution I've previously used.
You can get a C/Python implementation of a tst here: http://github.com/nlehuen/pytst

填充单词词典时的三元搜索树可以相当有效地找到最复杂的匹配词（单词）集。这是我以前使用过的解决方案。你可以在这里获得一个tst的C / Python实现：http：//github.com/nlehuen/pytst

Example:

例：

import tst
tree = tst.TST()
#note that tst.ListAction() assigns each matched term to a list
words = tree.scan("MultipleWordString", tst.ListAction())

Other Resources:

其他资源：

The open-source search engine called "Solr" uses what it calls a "Word-Boundary-Filter" to deal with this problem you might want to have a look at it.

名为“Solr”的开源搜索引擎使用它所谓的“Word-Boundary-Filter”来处理您可能想要查看的这个问题。

#3

This might be of use to you: http://www.clips.ua.ac.be/pattern

这可能对您有用：http：//www.clips.ua.ac.be/pattern

It's a set of modules which, depending on your system, might already be installed. It does all kinds of interesting stuff, and even if it doesn't do exactly what you need it might get you started on the right path.

这是一组模块，根据您的系统，可能已经安装了这些模块。它会做各种有趣的事情，即使它不能完全满足您的需求，也可能让您开始走上正确的道路。

#1

#2