从站点获取URL列表

时间:2022-11-11 22:58:52

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.

我正在为客户部署替换网站,但他们不希望所有旧网页都以404结尾。保留旧的URL结构是不可能的,因为它很可怕。

So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.

所以我正在编写一个404处理程序,它应该查找被请求的旧页面并永久重定向到新页面。问题是,我需要一个包含所有旧页面网址的列表。

I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.

我可以手动执行此操作,但如果有任何应用程序可以为我提供相关列表(例如:/ page / path,而不是http:/.../ page / path),我会感兴趣页。像蜘蛛一样但不关心内容而不是寻找更深层的页面。

8 个解决方案

#1


I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

我不是故意回答我自己的问题,但我只想到运行一个站点地图生成器。第一个我发现http://www.xml-sitemaps.com有一个很好的文本输出。完美满足我的需求。

#2


do wget -r -l0 www.oldsite.com

做wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

然后我会相信www.oldsite.com会透露所有网址。

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.

或者,只需在每个404请求上提供该自定义未找到的页面!即如果有人使用了错误的链接,他会得到页面告诉找不到页面,并对网站的内容做了一些提示。

#3


Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

以下是站点地图生成器列表(显然您可以从中获取站点中的URL列表):http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

Web Sitemap Generators

网站地图生成器

The following are links to tools that generate or maintain files in the XML Sitemaps format, an open standard defined on sitemaps.org and supported by the search engines such as Ask, Google, Microsoft Live Search and Yahoo!. Sitemap files generally contain a collection of URLs on a website along with some meta-data for these URLs. The following tools generally generate "web-type" XML Sitemap and URL-list files (some may also support other formats).

以下是生成或维护XML Sitemaps格式文件的工具的链接,这是一种在sitemaps.org上定义并由Ask,Google,Microsoft Live Search和Yahoo!等搜索引擎支持的开放标准。站点地图文件通常包含网站上的URL集合以及这些URL的一些元数据。以下工具通常生成“Web类型”XML站点地图和URL列表文件(有些也可能支持其他格式)。

Please Note: Google has not tested or verified the features or security of the third party software listed on this site. Please direct any questions regarding the software to the software's author. We hope you enjoy these tools!

请注意:Google未测试或验证本网站上列出的第三方软件的功能或安全性。请将有关软件的任何问题直接发送给软件作者。我们希望您喜欢这些工具!

Server-side Programs

  • Enarion phpSitemapsNG (PHP)
  • Enarion phpSitemapsNG(PHP)

  • Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
  • Google Sitemap Generator(Linux / Windows,32 / 64bit,开源)

  • Outil en PHP (French, PHP)
  • Outil en PHP(法语,PHP)

  • Perl Sitemap Generator (Perl)
  • Perl Sitemap生成器(Perl)

  • Python Sitemap Generator (Python)
  • Python Sitemap生成器(Python)

  • Simple Sitemaps (PHP)
  • 简单站点地图(PHP)

  • SiteMap XML Dynamic Sitemap Generator (PHP) $
  • SiteMap XML动态站点地图生成器(PHP)$

  • Sitemap generator for OS/2 (REXX-script)
  • OS / 2的站点地图生成器(REXX脚本)

  • XML Sitemap Generator (PHP) $
  • XML Sitemap Generator(PHP)$

CMS and Other Plugins:

CMS和其他插件:

  • ASP.NET - Sitemaps.Net
  • ASP.NET - Sitemaps.Net

  • DotClear (Spanish)
  • DotClear (2)
  • Drupal
  • ECommerce Templates (PHP) $
  • 电子商务模板(PHP)$

  • Ecommerce Templates (PHP or ASP) $
  • 电子商务模板(PHP或ASP)$

  • LifeType
  • MediaWiki Sitemap generator
  • MediaWiki Sitemap生成器

  • mnoGoSearch
  • OS Commerce
  • phpWebSite
  • Plone
  • RapidWeaver
  • Textpattern
  • vBulletin
  • Wikka Wiki (PHP)
  • Wikka Wiki(PHP)

  • WordPress

Downloadable Tools

  • GSiteCrawler (Windows)
  • GWebCrawler & Sitemap Creator (Windows)
  • GWebCrawler和Sitemap Creator(Windows)

  • G-Mapper (Windows)
  • Inspyder Sitemap Creator (Windows) $
  • Inspyder Sitemap Creator(Windows)$

  • IntelliMapper (Windows) $
  • IntelliMapper(Windows)$

  • Microsys A1 Sitemap Generator (Windows) $
  • Microsys A1 Sitemap Generator(Windows)$

  • Rage Google Sitemap Automator $ (OS-X)
  • Rage Google Sitemap Automator $(OS-X)

  • Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
  • 尖叫青蛙SEO蜘蛛和Sitemap生成器(Windows / Mac)$

  • Site Map Pro (Windows) $
  • Site Map Pro(Windows)$

  • Sitemap Writer (Windows) $
  • Sitemap Writer(Windows)$

  • Sitemap Generator by DevIntelligence (Windows)
  • DevIntelligence发布的Sitemap生成器(Windows)

  • Sorrowmans Sitemap Tools (Windows)
  • Sorrowmans站点地图工具(Windows)

  • TheSiteMapper (Windows) $
  • TheSiteMapper(Windows)$

  • Vigos Gsitemap (Windows)
  • Vigos Gsitemap(Windows)

  • Visual SEO Studio (Windows)
  • Visual SEO Studio(Windows)

  • WebDesignPros Sitemap Generator (Java Webstart Application)
  • WebDesignPros站点地图生成器(Java Webstart应用程序)

  • Weblight (Windows/Mac) $
  • Weblight(Windows / Mac)$

  • WonderWebWare Sitemap Generator (Windows)
  • WonderWebWare站点地图生成器(Windows)

Online Generators/Services

  • AuditMyPc.com Sitemap Generator
  • AuditMyPc.com网站地图生成器

  • AutoMapIt
  • Autositemap $
  • Enarion phpSitemapsNG
  • Free Sitemap Generator
  • 免费Sitemap生成器

  • Neuroticweb.com Sitemap Generator
  • Neuroticweb.com网站地图生成器

  • ROR Sitemap Generator
  • ROR Sitemap生成器

  • ScriptSocket Sitemap Generator
  • ScriptSocket Sitemap生成器

  • SeoUtility Sitemap Generator (Italian)
  • SeoUtility Sitemap Generator(意大利语)

  • SitemapDoc
  • Sitemapspal
  • SitemapSubmit
  • Smart-IT-Consulting Google Sitemaps XML Validator
  • Smart-IT-Consulting Google Sitemaps XML Validator

  • XML Sitemap Generator
  • XML Sitemap生成器

  • XML-Sitemaps Generator

CMS with integrated Sitemap generators

带有集成Sitemap生成器的CMS

  • Concrete5

Google News Sitemap Generators The following plugins allow publishers to update Google News Sitemap files, a variant of the sitemaps.org protocol that we describe in our Help Center. In addition to the normal properties of Sitemap files, Google News Sitemaps allow publishers to describe the types of content they publish, along with specifying levels of access for individual articles. More information about Google News can be found in our Help Center and Help Forums.

Google新闻站点地图生成器以下插件允许发布者更新Google新闻站点地图文件,这是我们在帮助中心中描述的sitemaps.org协议的一种变体。除了Sitemap文件的常规属性之外,Google新闻站点地图还允许发布者描述他们发布的内容类型,以及指定单个文章的访问级​​别。有关Google新闻的更多信息,请访问我们的帮助中心和帮助论坛。

  • WordPress Google News plugin
  • WordPress谷歌新闻插件

Code Snippets / Libraries

代码片段/库

  • ASP script
  • Emacs Lisp script
  • Emacs Lisp脚本

  • Java library
  • Perl script
  • PHP class
  • PHP generator script
  • PHP生成器脚本

If you believe that a tool should be added or removed for a legitimate reason, please leave a comment in the Webmaster Help Forum.

如果您认为应该出于正当理由添加或删除工具,请在网站管理员帮助论坛中发表评论。

#4


The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.

我发现的最好的是http://www.auditmypc.com/xml-sitemap.asp,它使用Java,对页面没有限制,甚至可以将结果导出为原始URL列表。

It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.

它还使用会话,因此如果您使用的是CMS,请确保在运行爬网之前注销。

#5


So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.

因此,在理想的世界中,您的网站中的所有网页都有一个规范。您还可以使用测试基础架构来测试所有页面以进行测试。

You're presumably not in an ideal world. Why not do this...?

你可能不是一个理想的世界。为什么不这样做......?

  1. Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".

    在众所周知的旧URL和新URL之间创建映射。看到旧网址时重定向。我可能会考虑提出“此页面已移动,它的新网址为XXX,您很快就会被重定向”。

  2. If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.

    如果您没有映射,请显示“抱歉 - 此页面已移动。这是指向主页的链接”消息,如果您愿意,可以重定向它们。

  3. Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.

    记录所有重定向 - 尤其是没有映射的重定向。随着时间的推移,为重要的页面添加映射。

#6


wget from a linux box might also be a good option as there are switches to spider and change it's output.

来自linux盒子的wget也可能是一个不错的选择,因为有切换到蜘蛛并改变它的输出。

EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm

编辑:wget也可以在Windows上找到:http://gnuwin32.sourceforge.net/packages/wget.htm

#7


Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.

编写一个蜘蛛,它从磁盘读取每个html并输出“a”元素的每个“href”属性(可以使用解析器完成)。请记住哪些链接属于某个页面(这是MultiMap数据结构的常见任务)。在此之后,您可以生成一个映射文件,该文件充当404处理程序的输入。

#8


I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.

我会研究任意数量的在线站点地图生成工具。就个人而言,我过去曾经使用过这个(基于java),但是如果你在谷歌搜索“sitemap builder”,我相信你会发现很多不同的选择。

#1


I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

我不是故意回答我自己的问题,但我只想到运行一个站点地图生成器。第一个我发现http://www.xml-sitemaps.com有一个很好的文本输出。完美满足我的需求。

#2


do wget -r -l0 www.oldsite.com

做wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

然后我会相信www.oldsite.com会透露所有网址。

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.

或者,只需在每个404请求上提供该自定义未找到的页面!即如果有人使用了错误的链接,他会得到页面告诉找不到页面,并对网站的内容做了一些提示。

#3


Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

以下是站点地图生成器列表(显然您可以从中获取站点中的URL列表):http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

Web Sitemap Generators

网站地图生成器

The following are links to tools that generate or maintain files in the XML Sitemaps format, an open standard defined on sitemaps.org and supported by the search engines such as Ask, Google, Microsoft Live Search and Yahoo!. Sitemap files generally contain a collection of URLs on a website along with some meta-data for these URLs. The following tools generally generate "web-type" XML Sitemap and URL-list files (some may also support other formats).

以下是生成或维护XML Sitemaps格式文件的工具的链接,这是一种在sitemaps.org上定义并由Ask,Google,Microsoft Live Search和Yahoo!等搜索引擎支持的开放标准。站点地图文件通常包含网站上的URL集合以及这些URL的一些元数据。以下工具通常生成“Web类型”XML站点地图和URL列表文件(有些也可能支持其他格式)。

Please Note: Google has not tested or verified the features or security of the third party software listed on this site. Please direct any questions regarding the software to the software's author. We hope you enjoy these tools!

请注意:Google未测试或验证本网站上列出的第三方软件的功能或安全性。请将有关软件的任何问题直接发送给软件作者。我们希望您喜欢这些工具!

Server-side Programs

  • Enarion phpSitemapsNG (PHP)
  • Enarion phpSitemapsNG(PHP)

  • Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
  • Google Sitemap Generator(Linux / Windows,32 / 64bit,开源)

  • Outil en PHP (French, PHP)
  • Outil en PHP(法语,PHP)

  • Perl Sitemap Generator (Perl)
  • Perl Sitemap生成器(Perl)

  • Python Sitemap Generator (Python)
  • Python Sitemap生成器(Python)

  • Simple Sitemaps (PHP)
  • 简单站点地图(PHP)

  • SiteMap XML Dynamic Sitemap Generator (PHP) $
  • SiteMap XML动态站点地图生成器(PHP)$

  • Sitemap generator for OS/2 (REXX-script)
  • OS / 2的站点地图生成器(REXX脚本)

  • XML Sitemap Generator (PHP) $
  • XML Sitemap Generator(PHP)$

CMS and Other Plugins:

CMS和其他插件:

  • ASP.NET - Sitemaps.Net
  • ASP.NET - Sitemaps.Net

  • DotClear (Spanish)
  • DotClear (2)
  • Drupal
  • ECommerce Templates (PHP) $
  • 电子商务模板(PHP)$

  • Ecommerce Templates (PHP or ASP) $
  • 电子商务模板(PHP或ASP)$

  • LifeType
  • MediaWiki Sitemap generator
  • MediaWiki Sitemap生成器

  • mnoGoSearch
  • OS Commerce
  • phpWebSite
  • Plone
  • RapidWeaver
  • Textpattern
  • vBulletin
  • Wikka Wiki (PHP)
  • Wikka Wiki(PHP)

  • WordPress

Downloadable Tools

  • GSiteCrawler (Windows)
  • GWebCrawler & Sitemap Creator (Windows)
  • GWebCrawler和Sitemap Creator(Windows)

  • G-Mapper (Windows)
  • Inspyder Sitemap Creator (Windows) $
  • Inspyder Sitemap Creator(Windows)$

  • IntelliMapper (Windows) $
  • IntelliMapper(Windows)$

  • Microsys A1 Sitemap Generator (Windows) $
  • Microsys A1 Sitemap Generator(Windows)$

  • Rage Google Sitemap Automator $ (OS-X)
  • Rage Google Sitemap Automator $(OS-X)

  • Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
  • 尖叫青蛙SEO蜘蛛和Sitemap生成器(Windows / Mac)$

  • Site Map Pro (Windows) $
  • Site Map Pro(Windows)$

  • Sitemap Writer (Windows) $
  • Sitemap Writer(Windows)$

  • Sitemap Generator by DevIntelligence (Windows)
  • DevIntelligence发布的Sitemap生成器(Windows)

  • Sorrowmans Sitemap Tools (Windows)
  • Sorrowmans站点地图工具(Windows)

  • TheSiteMapper (Windows) $
  • TheSiteMapper(Windows)$

  • Vigos Gsitemap (Windows)
  • Vigos Gsitemap(Windows)

  • Visual SEO Studio (Windows)
  • Visual SEO Studio(Windows)

  • WebDesignPros Sitemap Generator (Java Webstart Application)
  • WebDesignPros站点地图生成器(Java Webstart应用程序)

  • Weblight (Windows/Mac) $
  • Weblight(Windows / Mac)$

  • WonderWebWare Sitemap Generator (Windows)
  • WonderWebWare站点地图生成器(Windows)

Online Generators/Services

  • AuditMyPc.com Sitemap Generator
  • AuditMyPc.com网站地图生成器

  • AutoMapIt
  • Autositemap $
  • Enarion phpSitemapsNG
  • Free Sitemap Generator
  • 免费Sitemap生成器

  • Neuroticweb.com Sitemap Generator
  • Neuroticweb.com网站地图生成器

  • ROR Sitemap Generator
  • ROR Sitemap生成器

  • ScriptSocket Sitemap Generator
  • ScriptSocket Sitemap生成器

  • SeoUtility Sitemap Generator (Italian)
  • SeoUtility Sitemap Generator(意大利语)

  • SitemapDoc
  • Sitemapspal
  • SitemapSubmit
  • Smart-IT-Consulting Google Sitemaps XML Validator
  • Smart-IT-Consulting Google Sitemaps XML Validator

  • XML Sitemap Generator
  • XML Sitemap生成器

  • XML-Sitemaps Generator

CMS with integrated Sitemap generators

带有集成Sitemap生成器的CMS

  • Concrete5

Google News Sitemap Generators The following plugins allow publishers to update Google News Sitemap files, a variant of the sitemaps.org protocol that we describe in our Help Center. In addition to the normal properties of Sitemap files, Google News Sitemaps allow publishers to describe the types of content they publish, along with specifying levels of access for individual articles. More information about Google News can be found in our Help Center and Help Forums.

Google新闻站点地图生成器以下插件允许发布者更新Google新闻站点地图文件,这是我们在帮助中心中描述的sitemaps.org协议的一种变体。除了Sitemap文件的常规属性之外,Google新闻站点地图还允许发布者描述他们发布的内容类型,以及指定单个文章的访问级​​别。有关Google新闻的更多信息,请访问我们的帮助中心和帮助论坛。

  • WordPress Google News plugin
  • WordPress谷歌新闻插件

Code Snippets / Libraries

代码片段/库

  • ASP script
  • Emacs Lisp script
  • Emacs Lisp脚本

  • Java library
  • Perl script
  • PHP class
  • PHP generator script
  • PHP生成器脚本

If you believe that a tool should be added or removed for a legitimate reason, please leave a comment in the Webmaster Help Forum.

如果您认为应该出于正当理由添加或删除工具,请在网站管理员帮助论坛中发表评论。

#4


The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.

我发现的最好的是http://www.auditmypc.com/xml-sitemap.asp,它使用Java,对页面没有限制,甚至可以将结果导出为原始URL列表。

It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.

它还使用会话,因此如果您使用的是CMS,请确保在运行爬网之前注销。

#5


So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.

因此,在理想的世界中,您的网站中的所有网页都有一个规范。您还可以使用测试基础架构来测试所有页面以进行测试。

You're presumably not in an ideal world. Why not do this...?

你可能不是一个理想的世界。为什么不这样做......?

  1. Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".

    在众所周知的旧URL和新URL之间创建映射。看到旧网址时重定向。我可能会考虑提出“此页面已移动,它的新网址为XXX,您很快就会被重定向”。

  2. If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.

    如果您没有映射,请显示“抱歉 - 此页面已移动。这是指向主页的链接”消息,如果您愿意,可以重定向它们。

  3. Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.

    记录所有重定向 - 尤其是没有映射的重定向。随着时间的推移,为重要的页面添加映射。

#6


wget from a linux box might also be a good option as there are switches to spider and change it's output.

来自linux盒子的wget也可能是一个不错的选择,因为有切换到蜘蛛并改变它的输出。

EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm

编辑:wget也可以在Windows上找到:http://gnuwin32.sourceforge.net/packages/wget.htm

#7


Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.

编写一个蜘蛛,它从磁盘读取每个html并输出“a”元素的每个“href”属性(可以使用解析器完成)。请记住哪些链接属于某个页面(这是MultiMap数据结构的常见任务)。在此之后,您可以生成一个映射文件,该文件充当404处理程序的输入。

#8


I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.

我会研究任意数量的在线站点地图生成工具。就个人而言,我过去曾经使用过这个(基于java),但是如果你在谷歌搜索“sitemap builder”,我相信你会发现很多不同的选择。