如何在PHP中实现前向索引呢?

时间:2022-02-13 04:18:39

I am looking to implement a simple forward indexer in PHP. Yes I do understand that PHP is hardly the best tool for the task, but I want to do it anyway. The rationale behind it is simple: I want one, and in PHP.

我期待在PHP中实现一个简单的正向索引器。是的,我确实理解PHP不是最好的工具,但无论如何我都想做。它背后的基本原理很简单:我想要一个,在PHP中。

Let us make a few basic assumptions:

让我们做一些基本的假设:

  1. The entire Interweb consists of about five thousand HTML and/or plain-text documents. Each document resides within a particular domain (UID). No other proprietary/arcane formats exist in our imaginary cavemanesque Interweb.

    整个Interweb包含大约五千个HTML和/或纯文本文档。每个文档都驻留在特定域(UID)中。在我们想象中的洞穴间Interweb中没有其他专有/神秘的格式。

  2. The result of our awesome PHP-based forward indexing algorithm should be along the lines of:

    我们真棒的基于PHP的前向索引算法的结果应该是:

    UID1 -> index.html -> helen,she,was,champion,with,freckles

    UID1 - > index.html - > helen,她,是,冠军,有雀斑

    UID1 -> foo.html -> chicken,farmers,go,home,eat,sheep

    UID1 - > foo.html - >鸡,农民,去,回家,吃,羊

    UID2 -> blah.html -> next,week,on,badgerwatch

    UID2 - > blah.html - > next,week,on,badgerwatch

    UID2 -> gah.txt -> one,one,and,one,is,not,numberwang

    UID2 - > gah.txt - > one,one和one,is,not,numberwang

Ideally, I would love to see solutions that take into account, even at their most elementary, the concepts of tokenization/word boundary disambiguation/part-of-speech-tagging. Of course, I do realise this is wishful thinking, and therefore will humble any worthy attempts at parsing said imaginary documents by:

理想情况下,我希望看到解决方案,即使在最基本的时候,也会考虑标记化/字边界消歧/词性标注的概念。当然,我确实意识到这是一厢情愿的想法,因此将通过以下方式谦虚地解析所述虚构文档的任何有价值的尝试:

  1. Extracting the real textual content stuff within the document as a list of words in the order in which they are presented.
  2. 将文档中的真实文本内容提取为按照呈现顺序排列的单词列表。

  3. All the while, ignoring any garbage such as <script> and <html> tags to compute a list of UIDs (which could be, for instance, a domain) followed by document name (the resource within the domain) and finally the list of words for that document. I do realise that HTML tags play an important role in the semantic placement of text within a document, but at this stage I do not care.
  4. 一直以来,忽略任何垃圾,如

  5. Bear in mind a solution that can build the list of words WHILE reading the document is cooler that one which needs to read in the whole document first.
  6. 请记住一个可以构建单词列表的解决方案。阅读文档时,需要首先阅读整个文档的文档更酷。

At this stage, I do not care about the wheres or hows of storage. Even a rudimentary set of 'print' statements will suffice.

在这个阶段,我不关心存储的数量或方法。即使是一套基本的“印刷”陈述也足够了。

Thanks in advance, hope this was clear enough.

在此先感谢,希望这很清楚。

2 个解决方案

#1


Take a look at

看一眼

http://simplehtmldom.sourceforge.net/

You do somthing like

你做的很像

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

那会给你所有的文字。想要迭代链接

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

它是非常有用和强大的。看看这个。

#2


I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:

我不认为我对你要做的事情完全清楚,但你可以很容易地得到一个简单的结果:

  1. Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
  2. 通过Tidy(一个很好的介绍)运行页面,以确保它将具有有效的HTML。

  3. Throw away everything before (and including) <body>.
  4. 扔掉(包括)之前的所有东西。

  5. Step through the document one character at a time.
    1. If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
    2. 如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作

    3. If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
    4. 如果字符是“单词字符”(字母数字,连字符,可能更多),则将其附加到“当前单词”。

    5. If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
    6. 如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。

  6. 逐个浏览文档中的一个字符。如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作如果字符是“单词字符”(字母数字,连字符,可能更多),请将其附加到“现在的话”。如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。

  7. Do the above until you hit </body>.
  8. 执行上述操作直到您点击 。

That's really about it, you might have to add in some exceptions for handling things like <script> tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.

这真的是关于它,你可能不得不添加一些例外来处理诸如

#1


Take a look at

看一眼

http://simplehtmldom.sourceforge.net/

You do somthing like

你做的很像

$p = new Simple_dom_parser();
$p->load("www.page.com");
$p->find("body")->plaintext;

And that will give you all the text. Want to iterate over just the links

那会给你所有的文字。想要迭代链接

foreach ($p->find("a") as $link)
{
    echo $link->innerText;
}

It is very usefull and powerfull. Check it out.

它是非常有用和强大的。看看这个。

#2


I don't think I'm totally clear on what you're trying to do, but you can get a simple result fairly easily:

我不认为我对你要做的事情完全清楚,但你可以很容易地得到一个简单的结果:

  1. Run the page through Tidy (a good introduction) to make sure it's going to have valid HTML.
  2. 通过Tidy(一个很好的介绍)运行页面,以确保它将具有有效的HTML。

  3. Throw away everything before (and including) <body>.
  4. 扔掉(包括)之前的所有东西。

  5. Step through the document one character at a time.
    1. If the character is a '<', don't do anything with the following characters until you see a '>' (skips HTML)
    2. 如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作

    3. If the character is a "word character" (alphanumeric, hyphen, possibly more) append it to the "current word".
    4. 如果字符是“单词字符”(字母数字,连字符,可能更多),则将其附加到“当前单词”。

    5. If the character is a "non-word character" (punctuation, space, possibly more), add the "current word" to the word list in the forward index, and clear the "current word".
    6. 如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。

  6. 逐个浏览文档中的一个字符。如果字符是'<',则在看到'>'(跳过HTML)之前不要对以下字符执行任何操作如果字符是“单词字符”(字母数字,连字符,可能更多),请将其附加到“现在的话”。如果字符是“非单词字符”(标点符号,空格,可能更多),则将“当前单词”添加到正向索引中的单词列表,并清除“当前单词”。

  7. Do the above until you hit </body>.
  8. 执行上述操作直到您点击 。

That's really about it, you might have to add in some exceptions for handling things like <script> tags (you don't want to consider javascript to be words that should be indexed), but that should give you a basic forward index.

这真的是关于它,你可能不得不添加一些例外来处理诸如