I'm trying to download a static mirror of a wiki using wget. I only want the latest version of each article (not the full history or diffs between versions). It would be easy to just download the whole thing and delete unnecessary pages later, but doing so would take too much time and place an unnecessary strain on the server.
我正在尝试使用wget下载wiki的静态镜像。我只想要每篇文章的最新版本(不是完整的历史版本或版本之间的差异)。下载整个内容并稍后删除不必要的页面会很容易,但这样做会花费太多时间并给服务器带来不必要的压力。
There are a number of pages I clearly don't need such as:
有很多页面我显然不需要,例如:
WhoIsDoingWhat?action=diff&date=1184177979
Is there a way to tell wget not to download and recurse on URLs that have 'action=diff' in them? Or otherwise exclude URLs that match some regex?
有没有办法告诉wget不要下载并递归其中包含'action = diff'的网址?或者以其他方式排除与某些正则表达式匹配的URL?
1 个解决方案
#1
-R '*action=diff*,*action=edit*'
#1
-R '*action=diff*,*action=edit*'