I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.
我期待在PHP中实现一个简单的正向索引器。是的,我确实理解PHP不是最好的工具,但无论如何我都想做。它背后的基本原理很简单:我想要一个,在PHP中。
Let us make a few basic assumptions:
让我们做一些基本的假设:
-
The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.
整个Interweb包含大约五千个HTML和/或纯文本文档。每个文档都驻留在特定域(UID)中。在我们想象中的洞穴间Interweb中没有其他专有/神秘的格式。
-
The result of our awesome PHP-based forward indexing algorithm should be along the lines of:
我们真棒的基于PHP的前向索引算法的结果应该是:
UID1 -> index.html -> helen,she,was,champion,with,freckles
UID1 - > index.html - > helen,她,是,冠军,有雀斑
UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep
UID1 - > foo.html - >鸡,农民,去,回家,吃,羊
UID2 -> blah.html -> next,week,on,badgerwatch
UID2 - > blah.html - > next,week,on,badgerwatch
UID2 -> gah.txt -> one,one,and,one,is,not,numberwang
UID2 - > gah.txt - > one,one和one,is,not,numberwang
Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:
理想情况下,我希望看到解决方案,即使在最基本的时候,也会考虑标记化/字边界消歧/词性标注的概念。当然,我确实意识到这是一厢情愿的想法,因此将通过以下方式谦虚地解析所述虚构文档的任何有价值的尝试:
- Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
- All the while, ignoring any garbage such as
<script>
and<html>
tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care. - Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.
将文档中的真实文本内容提取为按照呈现顺序排列的单词列表。
一直以来,忽略任何垃圾,如
请记住一个可以构建单词列表的解决方案。阅读文档时,需要首先阅读整个文档的文档更酷。
At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.
在这个阶段,我不关心存储的数量或方法。即使是一套基本的“印刷”陈述也足够了。
Thanks in advance, hope this was clear enough.
在此先感谢,希望这很清楚。
2 个解决方案
#1
Take a look at
看一眼
http://simplehtmldom.sourceforge.net/
You do somthing like
你做的很像
$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;
And that will give you all the text. Want to iterate over just the links
那会给你所有的文字。想要迭代链接
foreach ($p->find("a") as $link)
{
echo $link->innerText;
}
It is very usefull and powerfull. Check it out.
它是非常有用和强大的。看看这个。
#2
I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:
我不认为我对你要做的事情完全清楚,但你可以很容易地得到一个简单的结果:
- Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
- Throw away everything before (and including)
<body>
. - Step through the document one character at a time.
- If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
- If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
- If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作
如果字符是“单词字符”(字母数字,连字符,可能更多),则将其附加到“当前单词”。
如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。
- Do the above until you hit
</body>
.
通过Tidy(一个很好的介绍)运行页面,以确保它将具有有效的HTML。
扔掉(包括)之前的所有东西。
逐个浏览文档中的一个字符。如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作如果字符是“单词字符”(字母数字,连字符,可能更多),请将其附加到“现在的话”。如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。
执行上述操作直到您点击 。
That's really about it, you might have to add in some exceptions for handling things like <script>
tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.
这真的是关于它,你可能不得不添加一些例外来处理诸如
#1
Take a look at
看一眼
http://simplehtmldom.sourceforge.net/
You do somthing like
你做的很像
$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;
And that will give you all the text. Want to iterate over just the links
那会给你所有的文字。想要迭代链接
foreach ($p->find("a") as $link)
{
echo $link->innerText;
}
It is very usefull and powerfull. Check it out.
它是非常有用和强大的。看看这个。
#2
I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:
我不认为我对你要做的事情完全清楚,但你可以很容易地得到一个简单的结果:
- Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
- Throw away everything before (and including)
<body>
. - Step through the document one character at a time.
- If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
- If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
- If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作
如果字符是“单词字符”(字母数字,连字符,可能更多),则将其附加到“当前单词”。
如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。
- Do the above until you hit
</body>
.
通过Tidy(一个很好的介绍)运行页面,以确保它将具有有效的HTML。
扔掉(包括)之前的所有东西。
逐个浏览文档中的一个字符。如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作如果字符是“单词字符”(字母数字,连字符,可能更多),请将其附加到“现在的话”。如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。
执行上述操作直到您点击 。
That's really about it, you might have to add in some exceptions for handling things like <script>
tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.
这真的是关于它,你可能不得不添加一些例外来处理诸如