查找并删除孤立的web页面、图像和其他相关文件

时间:2021-08-03 07:31:43

I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.

我正在开发一些可以追溯到2000年的网站。随着时间的推移,这些站点已经有机地增长,导致大量孤立的web页面,包括文件、图像、CSS文件、JavaScript文件等等……这些孤立的文件导致了许多问题,包括糟糕的可维护性、可能的安全漏洞、糟糕的客户体验,以及像我一样疯狂地驾驶OCD/GTD。

These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.

这些文件的数量成千上万,所以完全手工解决方案是不可行的。最终,清理过程将需要相当大的QA工作,以确保我们没有不经意地删除所需的文件,但我希望开发一种技术解决方案,以帮助加快手工工作。此外,我希望将流程/实用程序放在适当的位置,以帮助防止这种无序状态在将来发生。

Environment Considerations:

环境注意事项:

  • Classic ASP and .Net
  • 经典ASP。net
  • Windows servers running IIS 6 and IIS 7
  • 运行IIS 6和IIS 7的Windows服务器
  • Multiple environments (Dev, Integration, QA, Stage, Prodction)
  • 多环境(开发、集成、QA、阶段、流程)
  • TFS for source control
  • TFS的源代码控制

Before I start I would like to get some feedback from others who have successfully navigated a similar process.

在我开始之前,我想从其他成功通过类似过程的人那里得到一些反馈。

Specifically I am looking for:

我特别在寻找:

  • Process for identifying and cleaning up orphaned files
  • 识别和清理孤立文件的过程。
  • Process for keeping environments clean from orphaned files
  • 从孤立文件中保持环境清洁的过程
  • Utilities that help identify orphaned files
  • 帮助标识孤立文件的实用程序
  • Utilities that help identify broken links (once files have been removed)
  • 帮助识别损坏链接的实用工具(一旦文件被删除)

I am not looking for:

我不是在寻找:

  • Solutions to my organizational OCD...I like how I am.
  • 我组织OCD的解决方案……我喜欢我现在的样子。
  • Snide comments about us still using classic ASP. I already feel the pain. There is no need to rub it in.
  • 恶意评论我们仍然使用经典的ASP。我已经感觉到疼痛了。没有必要反复强调。

4 个解决方案

#1


1  

Step 1: Establish a list of pages on your site which are definitely visible. One intelligent way to create this list is to parse your log files for pages people visit.

第一步:在你的站点上建立一个明确可见的页面列表。创建此列表的一种智能方法是为人们访问的页面解析日志文件。

Step 2: Run a tool that recursively finds site topology, starting from a specially written page (that you will make on your site) which has a link to each page in step 1. One tool which can do this is Xenu's Link Sleuth. It's intended for finding dead links, but it will list live links as well. This can be run externally, so there are no security concerns with installing 'weird' software onto your server. You'll need to watch over this occasionally since your site may have infinite pages and the like if you have bugs or whatever.

步骤2:运行一个递归地查找站点拓扑的工具,从一个特别编写的页面(您将在站点上创建该页面)开始,该页面具有到步骤1中的每个页面的链接。一个可以做到这一点的工具是Xenu's Link Sleuth。它旨在寻找死链接,但它也将列出活链接。这可以在外部运行,因此在服务器上安装“奇怪”软件并不存在安全问题。您需要偶尔查看一下,因为您的站点可能有无数的页面,如果您有bug或其他什么。

Step 3: Run a tool that recursively maps your hard disk, starting from your site web directory. I can't think of any of these off the top of my head, but writing one should be trivial, and is safer since this will be run on your server.

步骤3:运行一个递归地映射你的硬盘的工具,从你的网站目录开始。我想不出其中任何一个,但写一个应该很简单,而且更安全,因为它将在您的服务器上运行。

Step 4: Take the results of steps 2 and 3 programmatically match #2 against #3. Anything in #3 not in #2 is potentially an orphan page.

步骤4:采用步骤2和步骤3的结果,以编程方式匹配2和3。#3中没有#2中的任何内容都可能是孤立页面。

Note: This technique works poorly with password-protected stuff, and also works poorly with sites relying heavily on dynamically generated links (dynamic content is fine if the links are consistent).

注意:这种技术在密码保护的东西上工作得很糟糕,而且在严重依赖动态生成的链接的网站上工作得也很糟糕(如果链接是一致的,动态内容就很好)。

#2


2  

At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there.

一开始,我认为你可以通过扫描文件寻找链接,然后对你的文件夹结构做一个小改动——但是这只能识别简单的孤儿,而不是相互引用的孤儿文件的集合。所以,使用grep可能不会让你一路走到那里。

This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)

这并不是一个简单的解决方案,但它将成为保持环境整洁的优秀工具(因此值得付出努力)。另外,您可以在所有环境中重用它(并与其他环境共享它!)

The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example:

基本思想是设置和填充一个方向图,其中每个节点的键是绝对路径。这是通过扫描所有文件和添加依赖项来完成的——例如:

/index.html     -> /subfolder/file.jpg
                -> /subfolder/temp.html
                -> /error.html
/temp.html      -> /index.html
/error.html     
/stray.html     -> /index.html
/abandoned.html

Then, you can identify all your "reachable" files by doing a BFS on your root page.

然后,您可以通过在根页面上执行BFS来识别所有“可访问”文件。

With the directional graph, you can also classify files by their in and out degree. In the example above:

使用方向图,您还可以根据文件的进出程度对其进行分类。在上面的示例:

/index.html     in: 1 out: 2
/temp.html      in: 1 out: 1
/error.html     in: 1 out: 0
/stray.html     in: 0 out: 1
/abandoned.html in: 0 out: 0

So, you're basically looking for files that have in = 0 that are abandoned.

所以,你基本上是在寻找那些在= 0中被丢弃的文件。

Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).

此外,out = 0的文件将是终端页面;在您的站点上,这可能是理想的,也可能不是理想的(如error所示,这是一个错误页面)。

#3


1  

No snide comments here... I feel your pain as a large portion of our site is still in classic ASP.

这里没有恶意评论……我觉得你的痛苦是我们网站的很大一部分仍然在经典的ASP。

I don't know of any fully automated systems that will be a magic bullet, but I dd have a couple of ideas for what could help. At least it's how we cleaned up our site.

我不知道有什么完全自动化的系统会是一颗神奇的子弹,但是我有一些想法可以帮助我。至少我们是这样清理我们的网站的。

First, although it hardly seems like the tool for such a job, I've used Microsoft Viso to help with this. We have Visio for Enterprise Architects, and I am not sure if this feature is in other versions, but in this version, you can create a new document, and in the "choose drawing type" under the "Web Diagram" folder, there is an option for a "Web Site Map" (either Metric or US units - it doesn't matter).

首先,尽管它看起来不太适合做这样的工作,但我已经使用了Microsoft Viso来帮助解决这个问题。我们Visio企业架构师,我不确定这个特性在其他版本,但在这个版本中,您可以创建一个新文档,在“选择绘图类型”下的“网络图”文件夹中,有一个选项为“网站地图”(公制或我们单位——这并不重要)。

When you create this drawing type, Visio prompts you for the URL of your web site, and then goes out and crawls your web site for you.

当您创建这个绘图类型时,Visio会提示您访问web站点的URL,然后将您的web站点搜索出来。

This should help to identify which files are valid. It's not perfect, but the way we used it was to find the files in the file system that did not show up in the Visio drawing, and then pull up the entire solution in Visual Studio and do a search for that file name. If we could not find it in the entire solution, we moved it off into an "Obsolete" folder for a month, and deleted it if we didn't start getting complaints or 404 errors on the web site.

这将有助于识别哪些文件是有效的。它并不完美,但是我们使用它的方式是在文件系统中查找在Visio绘图中没有显示的文件,然后在Visual Studio中提取整个解决方案,并搜索那个文件名。如果我们在整个解决方案中找不到它,我们将它转移到一个“废弃”文件夹中一个月,如果我们在网站上没有收到投诉或404错误,我们将它删除。

Other possible solutions would be to use log file parser and parse your logs for the last n months and look for missing files this way, but that would essentially be a lot of coding to come up with a list of "known good" files that's really no better than the Visio option.

其他可能的解决方案是使用日志文件解析器和解析n个月的日志,寻找丢失的文件,但是,实际上有很多编码想出一个“好”文件列表,没有比Visio的选择。

#4


0  

Been there, done that many times. Why can't the content types clean up after themselves? Personally, I'd hit it something like this:

去过那里,做过很多次。为什么内容类型不能自己清理?就我个人而言,我会这样说:

1) Get a copy of the site running in a QA environment.

1)获取运行在QA环境中的站点的副本。

2) Use selinum (or some other browser-based testing tool) to create a suite of tests for stuff that works.

2)使用selinum(或其他基于浏览器的测试工具)为有效的东西创建一套测试。

3) Start deleting stuff that should be deleted.

3)开始删除应该删除的内容。

4) Run tests from #2 after deleting stuff to insure it still works.

4)删除内容后从#2运行测试,以确保它仍然有效。

5) Repeat #s 3 & 4 until satisfied.

5)重复三、四次,直到满意为止。

#1


1  

Step 1: Establish a list of pages on your site which are definitely visible. One intelligent way to create this list is to parse your log files for pages people visit.

第一步:在你的站点上建立一个明确可见的页面列表。创建此列表的一种智能方法是为人们访问的页面解析日志文件。

Step 2: Run a tool that recursively finds site topology, starting from a specially written page (that you will make on your site) which has a link to each page in step 1. One tool which can do this is Xenu's Link Sleuth. It's intended for finding dead links, but it will list live links as well. This can be run externally, so there are no security concerns with installing 'weird' software onto your server. You'll need to watch over this occasionally since your site may have infinite pages and the like if you have bugs or whatever.

步骤2:运行一个递归地查找站点拓扑的工具,从一个特别编写的页面(您将在站点上创建该页面)开始,该页面具有到步骤1中的每个页面的链接。一个可以做到这一点的工具是Xenu's Link Sleuth。它旨在寻找死链接,但它也将列出活链接。这可以在外部运行,因此在服务器上安装“奇怪”软件并不存在安全问题。您需要偶尔查看一下,因为您的站点可能有无数的页面,如果您有bug或其他什么。

Step 3: Run a tool that recursively maps your hard disk, starting from your site web directory. I can't think of any of these off the top of my head, but writing one should be trivial, and is safer since this will be run on your server.

步骤3:运行一个递归地映射你的硬盘的工具,从你的网站目录开始。我想不出其中任何一个,但写一个应该很简单,而且更安全,因为它将在您的服务器上运行。

Step 4: Take the results of steps 2 and 3 programmatically match #2 against #3. Anything in #3 not in #2 is potentially an orphan page.

步骤4:采用步骤2和步骤3的结果,以编程方式匹配2和3。#3中没有#2中的任何内容都可能是孤立页面。

Note: This technique works poorly with password-protected stuff, and also works poorly with sites relying heavily on dynamically generated links (dynamic content is fine if the links are consistent).

注意:这种技术在密码保护的东西上工作得很糟糕,而且在严重依赖动态生成的链接的网站上工作得也很糟糕(如果链接是一致的,动态内容就很好)。

#2


2  

At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there.

一开始,我认为你可以通过扫描文件寻找链接,然后对你的文件夹结构做一个小改动——但是这只能识别简单的孤儿,而不是相互引用的孤儿文件的集合。所以,使用grep可能不会让你一路走到那里。

This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)

这并不是一个简单的解决方案,但它将成为保持环境整洁的优秀工具(因此值得付出努力)。另外,您可以在所有环境中重用它(并与其他环境共享它!)

The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example:

基本思想是设置和填充一个方向图,其中每个节点的键是绝对路径。这是通过扫描所有文件和添加依赖项来完成的——例如:

/index.html     -> /subfolder/file.jpg
                -> /subfolder/temp.html
                -> /error.html
/temp.html      -> /index.html
/error.html     
/stray.html     -> /index.html
/abandoned.html

Then, you can identify all your "reachable" files by doing a BFS on your root page.

然后,您可以通过在根页面上执行BFS来识别所有“可访问”文件。

With the directional graph, you can also classify files by their in and out degree. In the example above:

使用方向图,您还可以根据文件的进出程度对其进行分类。在上面的示例:

/index.html     in: 1 out: 2
/temp.html      in: 1 out: 1
/error.html     in: 1 out: 0
/stray.html     in: 0 out: 1
/abandoned.html in: 0 out: 0

So, you're basically looking for files that have in = 0 that are abandoned.

所以,你基本上是在寻找那些在= 0中被丢弃的文件。

Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).

此外,out = 0的文件将是终端页面;在您的站点上,这可能是理想的,也可能不是理想的(如error所示,这是一个错误页面)。

#3


1  

No snide comments here... I feel your pain as a large portion of our site is still in classic ASP.

这里没有恶意评论……我觉得你的痛苦是我们网站的很大一部分仍然在经典的ASP。

I don't know of any fully automated systems that will be a magic bullet, but I dd have a couple of ideas for what could help. At least it's how we cleaned up our site.

我不知道有什么完全自动化的系统会是一颗神奇的子弹,但是我有一些想法可以帮助我。至少我们是这样清理我们的网站的。

First, although it hardly seems like the tool for such a job, I've used Microsoft Viso to help with this. We have Visio for Enterprise Architects, and I am not sure if this feature is in other versions, but in this version, you can create a new document, and in the "choose drawing type" under the "Web Diagram" folder, there is an option for a "Web Site Map" (either Metric or US units - it doesn't matter).

首先,尽管它看起来不太适合做这样的工作,但我已经使用了Microsoft Viso来帮助解决这个问题。我们Visio企业架构师,我不确定这个特性在其他版本,但在这个版本中,您可以创建一个新文档,在“选择绘图类型”下的“网络图”文件夹中,有一个选项为“网站地图”(公制或我们单位——这并不重要)。

When you create this drawing type, Visio prompts you for the URL of your web site, and then goes out and crawls your web site for you.

当您创建这个绘图类型时,Visio会提示您访问web站点的URL,然后将您的web站点搜索出来。

This should help to identify which files are valid. It's not perfect, but the way we used it was to find the files in the file system that did not show up in the Visio drawing, and then pull up the entire solution in Visual Studio and do a search for that file name. If we could not find it in the entire solution, we moved it off into an "Obsolete" folder for a month, and deleted it if we didn't start getting complaints or 404 errors on the web site.

这将有助于识别哪些文件是有效的。它并不完美,但是我们使用它的方式是在文件系统中查找在Visio绘图中没有显示的文件,然后在Visual Studio中提取整个解决方案,并搜索那个文件名。如果我们在整个解决方案中找不到它,我们将它转移到一个“废弃”文件夹中一个月,如果我们在网站上没有收到投诉或404错误,我们将它删除。

Other possible solutions would be to use log file parser and parse your logs for the last n months and look for missing files this way, but that would essentially be a lot of coding to come up with a list of "known good" files that's really no better than the Visio option.

其他可能的解决方案是使用日志文件解析器和解析n个月的日志,寻找丢失的文件,但是,实际上有很多编码想出一个“好”文件列表,没有比Visio的选择。

#4


0  

Been there, done that many times. Why can't the content types clean up after themselves? Personally, I'd hit it something like this:

去过那里,做过很多次。为什么内容类型不能自己清理?就我个人而言,我会这样说:

1) Get a copy of the site running in a QA environment.

1)获取运行在QA环境中的站点的副本。

2) Use selinum (or some other browser-based testing tool) to create a suite of tests for stuff that works.

2)使用selinum(或其他基于浏览器的测试工具)为有效的东西创建一套测试。

3) Start deleting stuff that should be deleted.

3)开始删除应该删除的内容。

4) Run tests from #2 after deleting stuff to insure it still works.

4)删除内容后从#2运行测试,以确保它仍然有效。

5) Repeat #s 3 & 4 until satisfied.

5)重复三、四次,直到满意为止。