通过机器学习从网页中提取信息

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to write regular expression or even something like a grammar and to use a parser generator for parsing it out.

我想从Python中的网页中提取特定类型的信息。让我们说邮政地址。它有数千种形式，但仍然可以辨认出来。由于存在大量表单，因此编写正则表达式甚至是语法之类的东西并使用解析器生成器来解析它可能非常困难。

So I think the way I should go is machine learning. If I understand it well, I should be able to make a sample of data where I will point out what should be the result and then I have something which can learn from this how to recognize the result by itself. This is all I know about machine learning. Maybe I could use some natural language processing, but probably not much as all the libraries work with English mostly and I need this for Czech.

所以我认为应该采用的方式是机器学习。如果我理解得很好，我应该能够制作一个数据样本，在那里我将指出应该是什么结果，然后我有一些东西可以从中学习如何自己识别结果。这就是我所知道的机器学习。也许我可以使用一些自然语言处理，但可能并不多，因为所有的图书馆都使用英语，我需要这个用于捷克语。

Questions:

问题：

Can I solve this problem easily by machine learning? Is it a good way to go?
我可以通过机器学习轻松解决这个问题吗？这是一个好方法吗？
Are there any simple examples which would allow me to start? I am machine learning noob and I need something practical for start; closer to my problem is better; simpler is better.
有没有简单的例子可以让我开始？我是机器学习菜鸟，我需要一些实用的东西;更接近我的问题更好;更简单更好。
There are plenty of Python libraries for machine learning. Which one would suit my problem best?
有很多用于机器学习的Python库。哪一个最适合我的问题？
Lots of such libs have not very easy-to-use docs as they come from scientific environment. Are there any good sources (books, articles, quickstarts) bridging the gap, i.e. focused on newbies who know totally nothing about machine learning? Every docs I open start with terms I don't understand such as network, classification, datasets, etc.
很多这样的libs都不是很容易使用的文档，因为它们来自科学环境。是否有任何好的资料来源（书籍，文章，快速入门）弥合差距，即专注于对机器学习完全不了解的新手？我打开的每个文档都以我不理解的术语开头，例如网络，分类，数据集等。

Update:

更新：

As you all mentioned I should show a piece of data I am trying to get out of the web, here is an example. I am interested in cinema showtimes. They look like this (three of them):

正如你们所提到的，我应该展示一些我试图脱离网络的数据，这是一个例子。我对电影放映时间很感兴趣。它们看起来像这样（其中三个）：

<div class="Datum" rel="d_0">27. června – středa, 20.00
</div><input class="Datum_cas" id="2012-06-27" readonly=""><a href="index.php?den=0" rel="0" class="Nazev">Zahájení letního kina 
</a><div style="display: block;" class="ajax_box d-0">
<span class="ajax_box Orig_nazev">zábava • hudba • film • letní bar
</span>
<span class="Tech_info">Svět podle Fagi
</span>
<span class="Popis">Facebooková  komiksová Fagi v podání divadla DNO. Divoké písně, co nezařadíte, ale slušně si na ně zařádíte. Slovní smyčky, co se na nich jde oběsit. Kabaret, improvizace, písně, humor, zběsilost i v srdci.<br>Koncert Tres Quatros Kvintet. Instrumentální muzika s pevným funkovým groovem, jazzovými standardy a neodmyslitelnými improvizacemi.
</span>
<input class="Datum_cas" id="ajax_0" type="text">
</div>

<div class="Datum" rel="d_1">27. června – středa, 21.30
</div><input class="Datum_cas" id="2012-06-27" readonly=""><a href="index.php?den=1" rel="1" class="Nazev">Soul Kitchen
</a><div style="display: block;" class="ajax_box d-1">
<span class="ajax_box Orig_nazev">Soul Kitchen
</span>
<span class="Tech_info">Komedie, Německo, 2009, 99 min., čes. a angl. tit.
</span>
<span class="Rezie">REŽIE: Fatih Akin 
</span>
<span class="Hraji">HRAJÍ: Adam Bousdoukos, Moritz Bleibtreu, Birol Ünel, Wotan Wilke Möhring
</span>
<span class="Popis">Poslední film miláčka publika Fatiho Akina, je turbulentním vyznáním lásky multikulturnímu Hamburku. S humorem zde Akin vykresluje příběh Řeka žijícího v Německu, který z malého bufetu vytvoří originální restauraci, jež se brzy stane oblíbenou hudební scénou. "Soul Kitchen" je skvělá komedie o přátelství, lásce, rozchodu a boji o domov, který je třeba v dnešním nevypočitatelném světě chránit víc než kdykoliv předtím. Zvláštní cena poroty na festivalu v Benátkách
</span>
<input class="Datum_cas" id="ajax_1" type="text">
</div>

<div class="Datum" rel="d_2">28. června – čtvrtek, 21:30
</div><input class="Datum_cas" id="2012-06-28" readonly=""><a href="index.php?den=2" rel="2" class="Nazev">Rodina je základ státu
</a><div style="display: block;" class="ajax_box d-2">
<span class="Tech_info">Drama, Česko, 2011, 103 min.
</span>
<span class="Rezie">REŽIE: Robert Sedláček
</span>
<span class="Hraji">HRAJÍ: Igor Chmela, Eva Vrbková, Martin Finger, Monika A. Fingerová, Simona Babčáková, Jiří Vyorálek, Jan Fišar, Jan Budař, Marek Taclík, Marek Daniel
</span>
<span class="Popis">Když vám hoří půda pod nohama, není nad rodinný výlet. Bývalý učitel dějepisu, který dosáhl vysokého manažerského postu ve významném finančním ústavu, si řadu let spokojeně žije společně se svou rodinou v luxusní vile na okraji Prahy. Bezstarostný život ale netrvá věčně a na povrch začnou vyplouvat machinace s penězi klientů týkající se celého vedení banky. Libor se následně ocitá pod dohledem policejních vyšetřovatelů, kteří mu začnou tvrdě šlapat na paty. Snaží se uniknout před hrozícím vězením a oddálit osvětlení celé situace své nic netušící manželce. Rozhodne se tak pro netradiční útěk, kdy pod záminkou společné dovolené odveze celou rodinu na jižní Moravu…  Rodinný výlet nebo zoufalý úprk před spravedlností? Igor Chmela, Eva Vrbková a Simona Babčáková v rodinném dramatu a neobyčejné road-movie inspirované skutečností.
</span>

Or like this:

或者像这样：

<strong>POSEL&nbsp;&nbsp; 18.10.-22.10 v 18:30 </strong><br>Drama. ČR/90´. Režie: Vladimír Michálek Hrají: Matěj Hádek, Eva Leinbergerová, Jiří Vyorávek<br>Třicátník Petr miluje kolo a své vášni podřizuje celý svůj život. Neplánuje, neplatí účty, neřeší nic, co může<br>počkat  do zítra. Budování společného života s přételkyní je mu proti srsti  stejně jako dělat kariéru. Aby mohl jezdit na kole, raději pracuje jako  poslíček. Jeho život je neřízená střela, ve které neplatí žádná  pravidla. Ale problémy se na sebe na kupí a je stále těžší před nimi  ujet …<br> <br>

<strong>VE STÍNU&nbsp; 18.10.-24.10. ve 20:30 a 20.10.-22.10. též v 16:15</strong><br>Krimi. ČR/98´. Režie: D.Vondříček Hrají: I.*, S.Koch, S.Norisová, J.Štěpnička, M.Taclík<br>Kapitán  Hakl (Ivan *) vyšetřuje krádež v klenotnictví. Z běžné vloupačky  se ale vlivem zákulisních intrik tajné policie začíná stávat politická  kauza. Z nařízení Státní bezpečnosti přebírá Haklovo vyšetřování major  Zenke (Sebastian Koch), policejní specialista z NDR, pod jehož vedením  se vyšetřování ubírá jiným směrem, než Haklovi napovídá instinkt  zkušeného kriminalisty. Na vlastní pěst pokračuje ve vyšetřování. Může  jediný spravedlivý obstát v boji s dobře propojenou sítí komunistické  policie?&nbsp; Protivník je silný a Hakl se brzy přesvědčuje, že věřit nelze  nikomu a ničemu. Každý má svůj stín minulosti, své slabé místo, které  dokáže z obětí udělat viníky a z viníků hrdiny. <br><br>

<strong>ASTERIX A OBELIX VE SLUŽBÁCH JEJÍHO VELIČENSTVA&nbsp; ve 3D&nbsp;&nbsp;&nbsp; 20.10.-21.10. ve 13:45 </strong><br>Dobrodružná fantazy. Fr./124´. ČESKÝ DABING. Režie: Laurent Tirard<br>Hrají: Gérard Depardieu, Edouard Baer, Fabrice Luchini<br>Pod  vedením Julia Caesara napadly proslulé římské legie Británii. Jedné  malé vesničce se však daří statečně odolávat, ale každým dnem je slabší a  slabší. Britská královna proto vyslala svého věrného důstojníka  Anticlimaxe, aby vyhledal pomoc u Galů v druhé malinké vesničce ve  Francii vyhlášené svým důmyslným bojem proti Římanům… Když Anticlimax  popsal zoufalou situaci svých lidí, Galové mu darovali barel svého  kouzelného lektvaru a Astérix a Obélix jsou pověřeni doprovodit ho domů.  Jakmile dorazí do Británie, Anticlimax jim představí místní zvyky ve  vší parádě a všichni to pořádně roztočí! Vytočený Caesar se však  rozhodne naverbovat Normanďany, hrůzu nahánějící bojovníky Severu, aby  jednou provždy skoncovali s Brity. <br><br>

Or it can look like anything similar to this. No special rules in HTML markup, no special rules in order, etc.

或者它看起来像任何类似的东西。 HTML标记中没有特殊规则，顺序没有特殊规则等。

8 个解决方案

#1

First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task:

首先，您的任务适合研究的信息提取领域。此任务主要有两个级别的复杂性：

extract from a given html page or a website with the fixed template (like Amazon). In this case the best way is to look at the HTML code of the pages and craft the corresponding XPath or DOM selectors to get to the right info. The disadvantage with this approach is that it is not generalizable to new websites, since you have to do it for each website one by one.
从给定的html页面或具有固定模板的网站（如亚马逊）中提取。在这种情况下，最好的方法是查看页面的HTML代码并制作相应的XPath或DOM选择器以获得正确的信息。这种方法的缺点是它不适用于新网站，因为你必须逐个为每个网站做这件事。
create a model that extracts same information from many websites within one domain (having an assumption that there is some inherent regularity in the way web designers present the corresponding attribute, like zip or phone or whatever else). In this case you should create some features (to use ML approach and let IE algorithm to "understand the content of pages"). The most common features are: DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. You label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labelled pages. There is also an alternative to it - to do IE in unsupervised manner (leveraging the idea of information regularity across pages). In this case you/your algorith tries to find repetitive patterns across pages (without labelling) and consider as valid those, that are the most frequent.
创建一个模型，从一个域内的许多网站中提取相同的信息（假设网页设计师提供相应属性的方式存在一些固有的规律性，如zip或电话或其他任何内容）。在这种情况下，您应该创建一些功能（使用ML方法，让IE算法“理解页面内容”）。最常见的功能是：DOM路径，要提取的值（属性）的格式，布局（如粗体，斜体等）以及周围的上下文单词。您标记了一些值（您需要至少100-300页，具体取决于域以某种合理的质量）。然后在标记的页面上训练模型。还有一种替代方案 - 以无人监督的方式进行IE（利用跨页面信息规律性的想法）。在这种情况下，您/您的算法会尝试在页面之间找到重复的模式（没有标记），并将其视为有效的，这是最常见的。

The most challenging part overall will be to work with DOM tree and generate the right features. Also data labelling in the right way is a tedious task. For ML models - have a look at CRF, 2DCRF, semi-markov CRF.

总体上最具挑战性的部分是使用DOM树并生成正确的功能。以正确的方式进行数据标记也是一项繁琐的工作。对于ML模型 - 看看CRF，2DCRF，半马尔可夫CRF。

And finally, this is in the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

最后，在一般情况下，这是IE研究领域的一个前沿，而不是一个黑客，你可以在几个晚上做到这一点。

p.s. also I think NLTK will not be very helpful - it is an NLP, not Web-IE library.

附：我认为NLTK不会很有帮助 - 它是一个NLP，而不是Web-IE库。

#2

As to natural language processing, if you're using python you should absolutely check the fantastic (IMHO, not affiliated with them) Natural Language Toolkit, which has implementations for lots of algorithms, many of which are language agnostic (say, n-grams).

至于自然语言处理，如果你正在使用python，你应该绝对检查神奇的（恕我直言，不隶属于他们）自然语言工具包，它有许多算法的实现，其中许多是语言无关的（比如，n-gram））。

For a recommendation of machine learning library in python, I'd say it would depend on what techniques you want to use, but opencv implements some common algorithms. Machine learning is a very vast area. Just for the supervised-learning classification subproblem, there's Naive Bayes, KNN, Decision Trees, Support Vector Machines, at least a dozen different types of neural networks... The list goes on and on. This is why, as you say, there are no "quickstarts" or tutorials for machine learning in general. My advice here is, firstly, to understand the basic ML terminology, secondly, understand a subproblem (I'd advise supervised-learning classification), and thirdly, study a simple algorithm that solves this subproblem (KNN relies on highschool-level math).

对于python中的机器学习库的推荐，我会说它取决于你想要使用的技术，但是opencv实现了一些常见的算法。机器学习是一个非常广阔的领域。只是为了监督学习分类子问题，有朴素贝叶斯，KNN，决策树，支持向量机，至少十几种不同类型的神经网络......这个列表一直在继续。这就是为什么，正如你所说，通常没有机器学习的“快速入门”或教程。我的建议是，首先要理解基本的ML术语，其次，理解子问题（我建议监督学习分类），第三，研究解决这个子问题的简单算法（KNN依赖于高中数学）。

About your problem in particular: it seems you want detect the existence of a piece of data (postal code) inside an huge dataset (text), which is, AFAIK, not the type of problem ML handles. A classification algorithm expects a relatively small feature vector. To obtain that, you will need to do what's called a dimensionality reduction: this means, isolate the part that look like potential postal codes. Only then does the classification algorithm classify it (as "postal code" or "not postal code", for example).

特别是关于你的问题：你似乎想要在巨大的数据集（文本）中检测到一条数据（邮政编码）的存在，这就是AFAIK，而不是ML处理的问题类型。分类算法期望相对小的特征向量。要获得这一点，您需要执行所谓的维数减少：这意味着，隔离看起来像潜在邮政编码的部分。只有这样，分类算法才对其进行分类（例如，“邮政编码”或“非邮政编码”）。

Thus, you need to find a way to isolate potential matches before you even think about using ML to approach this problem. This will most certainly entail natural language processing, as you said, if you don't or can't use regex or parsing.

因此，在考虑使用ML来解决此问题之前，您需要找到一种隔离潜在匹配的方法。如果您不使用或不能使用正则表达式或解析，这肯定会涉及自然语言处理。

#3

Note, there is a stack exchange site dedicated to machine learning and statistical analysis called Cross Validated. You are much more likely to find information relevant to your problem set there.

注意，有一个专门用于机器学习和统计分析的堆栈交换站点，称为Cross Validated。您更有可能在那里找到与您的问题集相关的信息。

Although some programming skills are needed, Machine Learning as a field is really a speciality of programming and data analysis. It has it's own "language" and assumes a basic understanding of matrix operations and linear algebra in general. Much of the work an ML specialist must do, involves the manipulation of their source data into a form that the algorithms can work with.

虽然需要一些编程技能，但机器学习作为一个领域实际上是编程和数据分析的专长。它拥有自己的“语言”，并假定对矩阵运算和线性代数有基本的了解。 ML专家必须完成的大部分工作涉及将源数据操作为算法可以使用的形式。

On Cross Validated, you'll find the people that actually solve these kinds of problems on a daily basis, but be prepared to fall deep into the ML rabbit hole. There is much to learn.

在Cross Validated中，你会找到那些每天真正解决这些问题的人，但要准备好深入到ML兔洞。有很多东西需要学习。

https://stats.stackexchange.com/

#4

Firstly, Machine Learning is not magic. These algorithms perform specific tasks, even if these can be a bit complex sometimes.

首先，机器学习并不神奇。这些算法执行特定任务，即使这些任务有时可能有点复杂。

The basic approach of any such task is to generate some reasonably representative labeled data, so that you can evaluate how well you are doing. "BOI" tags could work, where for each word you assign it a label: "O" (outside) if it is not something you're looking for, "B" (beginning) if it is the start of an address, and "I" for all subsequent words (or numbers or whatever) in the address.

任何此类任务的基本方法是生成一些具有合理代表性的标记数据，以便您可以评估自己的表现。 “BOI”标签可以工作，你为每个单词分配一个标签：“O”（外面），如果它不是你要找的东西，“B”（开头），如果它是地址的开头，并且对于地址中的所有后续单词（或数字或其他），“我”。

The second step is to think about how you want to evaluate your success. Is it important that you discover the most part of an address, or do you also need to know exactly what the thing is (postcode or street or city, etc). This then changes what you count as an error.

第二步是考虑如何评估您的成功。您发现地址的大部分内容是否很重要，或者您是否还需要确切了解其中的内容（邮政编码或街道或城市等）。然后，这会将您认为的错误更改为。

If you want your named-entity recogniser to work well, you have to know your data well, and decide on the best tool for the job. This may very well be a series of regular expressions with some rules on how to combine the results. I expect you'll be able to find most of the data with relatively simple programmes. Once you have something simple that works, you check out the false positives (things that turned out not to be the thing you were looking for) and the false negatives (things that you missed), and look for patterns. If you see something that you can fix easily, try it out. A huge advantage of regex is that it is much easier to not only recognise something as part of an address, but also detect which part it is.

如果您希望命名实体识别器能够正常工作，您必须很好地了解您的数据，并确定最适合该工作的工具。这可能是一系列正则表达式，其中包含有关如何组合结果的一些规则。我希望你能用相对简单的程序找到大部分数据。一旦你有一些简单的东西可以运作，你可以看出误报（结果不是你想要的东西）和假阴性（你错过的东西），并寻找模式。如果您看到可以轻松修复的内容，请尝试一下。正则表达式的一个巨大优势是，不仅可以将某些东西识别为地址的一部分，而且还可以检测它是哪个部分。

If you want to move beyond that, you may find that many NLP methods don't perform well on your data, since "Natural Language Processing" usually needs something that looks like (you guessed it) Natural Language to recognise what something is.

如果你想超越它，你可能会发现许多NLP方法对你的数据表现不佳，因为“自然语言处理”通常需要看起来像（你猜对了）的自然语言来识别某些东西。

Alternatively, since you can view it as a chunking problem, you might use Maximum Entropy Markov Models. This uses probabilities of transitioning from one type of word to another to chunk text into "part of an address" and "not part of an address", in this case.

或者，由于您可以将其视为分块问题，因此您可以使用最大熵马尔可夫模型。在这种情况下，这使用从一种类型的字转换到另一种类型到块文本到“地址的一部分”和“不是地址的一部分”的概率。

Good luck!

祝你好运！

#5

I would suggest you look at the field of information extraction. A lot of people have been researching how to do exactly what you're asking. There are some techniques for information extraction that are machine learning based, some techniques that are not machine learning based.

我建议你看一下信息提取领域。很多人一直在研究如何做到你所要求的。有一些基于机器学习的信息提取技术，一些不基于机器学习的技术。

It is hard to comment further without looking at examples representative of the problem you want to solve (how does a postal address look in Czech?).

如果不查看代表你想要解决的问题的例子，很难进一步评论（邮政地址在捷克的情况如何？）。

#6

The approach needs to be a supervised learning algorithm (typically, they yield much better results than unsupervised or semi-supervised methods). Also, notice that you need to basically extract chunks of text. Intuitively, your algorithm needs to say something like, "from this character onward, for the next three lines, is a postal address".

该方法需要是监督学习算法（通常，它们比无监督或半监督方法产生更好的结果）。另外，请注意您需要基本上提取文本块。直观地说，你的算法需要说一句话，“从这个字符开始，对于接下来的三行，是一个邮政地址”。

I feel that a natural way to approach this will be a combination of word and character level n-gram language models. The modeling itself can be insanely sophisticated. As pointed out by mcstar, Cross Validated is a better place to get into those details.

我觉得接近这个的一种自然方式是单词和字符级n-gram语言模型的组合。建模本身可能非常复杂。正如mcstar所指出的，Cross Validated是了解这些细节的更好地方。

#7

As per i know there are two ways to do this task using machine learning approach.

据我所知，使用机器学习方法有两种方法可以完成这项任务。

1.Using computer vision to train the model and then extract the content based on your use case, this has already been implemented by diffbot.com. and they have not open sourced their solution.

1.使用计算机视觉训练模型，然后根据您的用例提取内容，这已经由diffbot.com实现。并且他们还没有开源他们的解决方案。

2.The other way to go around this problem is using supervised machine learning to train binary classifier to classify content vs boilerplate and then extract the content. This approach is used in dragnet. and other research around this area. You can have a look at benchmark comparison among different content extraction techniques.

解决这个问题的另一种方法是使用有监督的机器学习训练二进制分类器来对内容和样板进行分类，然后提取内容。这种方法用于dragnet。以及该领域的其他研究。您可以查看不同内容提取技术之间的基准比较。

#8

I had built a solution exactly for this. My goal was to extract all the information related to competitions available on the internet. I used a tweak. What I did is that I detected the pattern in which the information are listed on the websites. In my case, they were listed one by one below the order, I detected that using the html table tags and got the information related to the competitions.

我已经为此准备了一个解决方案。我的目标是提取与互联网上可用的竞赛相关的所有信息。我用了一个调整。我所做的是我检测到网站上列出信息的模式。在我的情况下，他们在订单下面逐个列出，我检测到使用html表标签并获得与比赛相关的信息。

While it is a good solution , it works for some site and for some others the same code wont work. But you only have to change some parameters in the same code to make it work.

虽然它是一个很好的解决方案，但它适用于某些网站，而对于其他一些网站，相同的代码不起作用。但您只需在同一代码中更改某些参数即可使其正常工作。

#1