在英文*中找到两篇文章的最短路径。

时间:2023-02-01 16:13:55

The question:

一个问题:

Find shortest path between two articles in english Wikipedia. Path between article A and B exist if there are articles C(i) and there is a link in article A that leads to article C(1), in article C(1) link that leads to article C(2), ..., in article C(n) is link that leads to article B

在英文*中找到两篇文章之间的最短路径。第A条和B条之间的路径如果有第C条(i),并且在第A条中有一条导致第C条(1)条的链接,第C条(1)条导致第C条(2),……在第C(n)条中,链接指向第B条。

I'm using Python. URL to download wikipedia article:

我使用Python。URL下载*文章:

  1. http://en.wikipedia.org/wiki/Nazwa_artykułu
  2. http://en.wikipedia.org/wiki/Nazwa_artykułu
  3. http://en.wikipedia.org/w/index.php?title?Nazwa_artykułu&printable=yes
  4. http://en.wikipedia.org/w/index.php?title?Nazwa_artykułu&printable=yes
  5. Wikipedia API
  6. *的API

I have edited my source code, but it still does not work when I include those articles in codes can any one tell me what am I messing here?

我已经编辑了我的源代码,但是当我在代码中包含这些文章时,它仍然不能工作,谁能告诉我这里我在搞什么?

This is my code:

这是我的代码:

import urllib2
import re
import xml.etree.ElementTree as ET

text = ET.fromstring(F_D.text.encode('UTF-8'))
text = ET.fromstring(P.text.encode('UTF-8'))
F_D=requests.get('http://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms')
P=requests.get('http://en.wikipedia.org/wiki/Wikipedia:Unusual_articles')  
links = text.findall('.//*[@id=”mw-content-text”]/p/a')

links=E_D

E_D = graph_dict
E_D[start] = 0

for vertex in E_D:
    F_D[vertex] = E_D[vertex]
    if vertex == end: break

    for edge in graph[vertex]:
        path_distance = F_D[vertex] + graph[vertex][edge]
        if edge in F_D:
            if path_distance < F_D[edge]:
                #raise ValueError,
            elif edge not in E_D or path_distance < E_D[edge]:
                E_D[edge] = path_distance
                [edge] = vertex
return (F_D,P)

def Shortest_Path(graph,start,end):
  F_D,P = D_Algorithm(graph,start,end)
  path = []
  while 1:
    path.append(end)
    if end == start: break
    end = P[end]
  path.reverse()
  return path

2 个解决方案

#1


2  

We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.

我们正在研究图形探索…为什么要考虑Dijkstra算法???IMHO……改变的方法。

First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.

首先,你需要一个好的启发式函数。对于您展开的每个节点,您需要从目标/目标节点对该节点的距离进行geusstimate。现在…如何计算启发式是真正的挑战。您可能会在当前的wiki页面和目标页面之间做一个关键字映射。匹配的百分比可以给你估计。还是……试着猜测两页内容之间的相关性。我有一种预感…也许神经网络可以帮助你。但是,这也可能不是最优估计。我不确定。一旦你找到了合适的方法,使用*搜索算法。

Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!

搜索和探索启发式的功能,不要进行广度优先搜索,你最终将不会在*的广阔世界里!

#2


-1  

Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

考虑到*上的文章数量,计算最短的(我的假设——我还没尝试过)会花费一个难以承受的时间。

The real problem is to find an acceptable and efficent short path between two articles.

真正的问题是在两篇文章之间找到一条可接受的有效的捷径。

Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

处理这类问题的算法与旅行商问题有关。这可能是一个很好的起点。

IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?

IIRC谷歌或雅虎机器人使用蚁群优化来得到最短的可接受时间。您可以检查这个问题:我在哪里可以学到更多关于“蚁群”优化的知识?

I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.

我也很喜欢遗传算法,在一定的时间内找到一个可接受的最佳状态。


I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

我刚刚看了这张图片,并在2013年为en.wikipedia.com设置了4.000.000篇文章的数量。比我想象的要少得多。

EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

编辑:我第一次说这是一个np困难问题,评论者解释说这不是。

#1


2  

We are looking at graph exploration... why should you be considering Dijkstra's algorithm??? IMHO... change the approach.

我们正在研究图形探索…为什么要考虑Dijkstra算法???IMHO……改变的方法。

First, you need a good heuristic function. For every node you expand, you need to geusstimate the distance of that node from the target/goal node. Now... how you compute the heuristic is the real challenge here. You may perhaps do a keyword mapping between the current wiki page and your destination page. A percentage of match may give you the estimate. Or... try to guess the relevance of content between the two pages. I have a hunch... perhaps a Neural Network may help you here. But, this may not indicate optimal estimate either. I'm not sure. Once you figure out a suitable way of doing this, use A* search algorithm.

首先,你需要一个好的启发式函数。对于您展开的每个节点,您需要从目标/目标节点对该节点的距离进行geusstimate。现在…如何计算启发式是真正的挑战。您可能会在当前的wiki页面和目标页面之间做一个关键字映射。匹配的百分比可以给你估计。还是……试着猜测两页内容之间的相关性。我有一种预感…也许神经网络可以帮助你。但是,这也可能不是最优估计。我不确定。一旦你找到了合适的方法,使用*搜索算法。

Search and explore the heuristic function, do not go for breadth first search, you'll end up no where in the vast wide world of wikipedia!

搜索和探索启发式的功能,不要进行广度优先搜索,你最终将不会在*的广阔世界里!

#2


-1  

Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

考虑到*上的文章数量,计算最短的(我的假设——我还没尝试过)会花费一个难以承受的时间。

The real problem is to find an acceptable and efficent short path between two articles.

真正的问题是在两篇文章之间找到一条可接受的有效的捷径。

Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

处理这类问题的算法与旅行商问题有关。这可能是一个很好的起点。

IIRC google or yahoo bots use Ant Colony Optimization to get the shortest acceptable in optimized time. You could check this SO question: Where can I learn more about "ant colony" optimizations?

IIRC谷歌或雅虎机器人使用蚁群优化来得到最短的可接受时间。您可以检查这个问题:我在哪里可以学到更多关于“蚁群”优化的知识?

I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.

我也很喜欢遗传算法,在一定的时间内找到一个可接受的最佳状态。


I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

我刚刚看了这张图片,并在2013年为en.wikipedia.com设置了4.000.000篇文章的数量。比我想象的要少得多。

EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

编辑:我第一次说这是一个np困难问题,评论者解释说这不是。