保存完整的网页(图片等)的最好的“文件格式”是什么?

时间:2022-10-12 22:28:04

I'm working on a project which stores single images and text files in one place, like a time capsule. Now, most every project can be saved as one file, like DOC, PPT, and ODF. But complete web pages can't -- they're saved as a separate HTML file and data folder. I want to save a web page in a single archive, and while there are several solutions, there's no "standard". Which is the best format for HTML archives?

我正在做一个项目,在一个地方存储单个图像和文本文件,就像一个时间胶囊。现在,大多数项目都可以保存为一个文件,比如DOC、PPT和ODF。但是完整的网页不能——它们被保存为一个单独的HTML文件和数据文件夹。我想将一个web页面保存到一个单独的归档文件中,虽然有几个解决方案,但是没有“标准”。HTML档案的最佳格式是什么?

  • Microsoft has MHTML -- basically a file encoded exactly as a MIME HTML email message. It's already based on an existing standard, and MHTML as its own was proposed as rfc2557. This is a great idea and it's been around forever, except it's been a "proposed standard" since 1999. Plus, implementations other than IE's are just cumbersome. IE and Opera support it; Firefox and Safari with a cumbersome extension.

    微软有MHTML——基本上是一个完全像MIME HTML电子邮件消息一样编码的文件。它已经基于一个现有的标准,MHTML作为它自己的标准被提议为rfc2557。这是一个伟大的想法,而且一直存在,除了1999年以来它一直是一个“拟议的标准”。此外,除IE之外的实现都很麻烦。IE和Opera支持它;Firefox和Safari有一个复杂的扩展。

  • Mozilla has Mozilla Archive Format -- basically a ZIP file with the markup and images, with metadata saved as RDF. It's an awesome idea -- Winamp does this for skins, and ODF and OOXML for their embedded images. I love this, except, 1. Nobody else except Mozilla uses it, 2. The only extension supporting it wasn't updated since Firefox 1.5.

    Mozilla有Mozilla存档格式——基本上是一个带有标记和图像的ZIP文件,并保存为RDF的元数据。这是一个很棒的主意——Winamp是为皮肤做的,而ODF和OOXML是为他们的嵌入式图像做的。我喜欢这个,除了,1。除了Mozilla没有人使用它。唯一支持它的扩展是Firefox 1.5之后没有更新的。

  • Data URIs are becoming more popular. Instead of referencing an external location a la MHTML or MAF, you encode the file straight into the HTML markup as base64. Depending on your view, it's streamlined since the files are right where the markup is. However, support is still somewhat weak. Firefox, Opera, and Safari support it without gaffes; IE, the market leader, only started supporting it at IE8, and even then with limits.

    数据uri正变得越来越流行。不是引用外部位置la MHTML或MAF,而是直接将文件编码为base64。根据您的视图,它是流线型的,因为文件位于标记所在的位置。然而,支持仍然有些微弱。Firefox、Opera和Safari都支持它,没有错误;IE,这个市场的领头羊,只是在IE8上才开始支持它,即使那时也有限制。

  • Then of course, there's "Save complete webpage" where the HTML markup is saved as "savedpage.html" and the files in a separate "savedpage_files" folder. Afaik, everyone does this. It's well supported. But having to handle two separate elements is not simple and streamlined at all. My project needs to have them in a single archive.

    当然,还有“保存完整网页”,HTML标记保存为“savedpage”。和一个单独的“savedpage_files”文件夹中的文件。Afaik,每个人都这样。具有很好的支持。但是,要处理两个独立的元素并不是简单的和流线型的。我的项目需要将它们放在一个单独的归档文件中。

Keeping in mind browser support and ease of editing the page, what do you think's the best way to save web pages in a single archive? What would be best as a "standard"? Or should I just buckle down and deal with the HTML file and separate folder? For the sake of my project, I could support that, but I'd best avoid it.

记住浏览器支持和编辑页面的简便性,您认为将web页面保存在单个归档文件中的最佳方法是什么?什么是最好的“标准”?或者我应该全力以赴地处理HTML文件和单独的文件夹?为了我的项目,我可以支持它,但我最好避免它。

7 个解决方案

#1


15  

My favourite is the ZIP format. Because:

我最喜欢的是ZIP格式。因为:

  • It is very well sutied for the purpose
  • 这是很有必要的。
  • It is well documented
  • 能够很好的证明
  • There a a lot of implementations available for creating or reading them
  • 有许多可用于创建或读取它们的实现
  • A user can easily extract single files, change them and put them back in the archive
  • 用户可以很容易地提取单个文件,更改它们并将它们放回归档。
  • Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in
  • 几乎每个主要的操作系统(Windows、Mac和大多数linux)都有一个内置的ZIP程序。

The alternatives all have some flaw:

所有的替代方案都有一些缺陷:

  • With MHTMl, you can not easily edit.
  • 使用MHTMl,您无法轻松编辑。
  • With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
  • 对于数据URI,我不知道实现起来有多难。(用ZIP,甚至我也可以用PHP, 3年前……)
  • The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.
  • 将内容存储为独立文件的选项有太多的错误,可能会使归档变得混乱。

#2


4  

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

几乎所有平台上的所有浏览器都支持PDFs,并将内容和图像存储在一个文件中。它们可以用正确的工具进行编辑。这几乎肯定不是理想的,但这是一个可以考虑的选项。

#3


4  

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

这不仅仅是文件格式的问题。另一个关键问题是,你到底想要存储什么?它是:

  1. store whole page as it is with all referenced resources - images, CSS and javascript?

    将整个页面存储为所有引用的资源——图像、CSS和javascript?

  2. to capture page as it was rendered at some point in time; a static image of some rendered state of web page DOM?

    捕获在某个时间点上呈现的页面;一个静态的web页面DOM状态图像?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

大多数当前的“保存页面”功能在浏览器中,无论是对MAF或MHTML或文件+目录,尝试第一种方式。这种做法最终是有缺陷的。

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

不要忘记,现在的web页面都是本地应用程序,然后是可以轻松存储的静态文档。潜在的问题:

  1. one page is in fact several pages build dynamically by JS, user interaction is needed to get it to desired state

    一个页面实际上是几个由JS动态构建的页面,需要用户交互才能使其达到理想状态

  2. AJAX applications can do remote communication with remote service rendering it unusable for offline view.

    AJAX应用程序可以通过远程服务进行远程通信,使其不能用于脱机视图。

  3. Hidden links in javascript code. Such resource is then not part of stored page. Even parsing JS code may not discover them. You need to run the code.

    javascript代码中的隐藏链接。这样的资源就不是存储页面的一部分。即使解析JS代码也不会发现它们。您需要运行代码。

  4. Even position of basic html elements may be recomputed may be computed dynamically by JS and it is not always possible/easy to recreate it locally.

    甚至可以重新计算基本html元素的位置,也可以由JS动态计算,而且在本地重新创建它并不总是可能的/容易的。

  5. You would need some sort of JS memory dump and load this to get page to desired state you hoped to store

    您需要某种JS内存转储,并将其加载到希望存储的页面状态

And many many more issues...

还有更多的问题……

Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

检查Chrome SingleFile扩展。它将一个web页面存储到一个html文件中,其中的图像使用前面提到的数据uri进行内联。我还没有对它进行过多的测试,所以我不能说它处理“易变”ajax页面的能力有多强。

#4


3  

Use a zip file.

使用一个zip文件。

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

您可以编写一个程序/脚本,将zip文件解压到一个临时目录并载入索引。浏览器中的html文件。你甚至可以使用索引。ini/txt文件指定提取时应该加载的文件。

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

基本上,您需要类似于Mozilla归档格式的东西,但是不需要不必要的rdf垃圾,只需指定要加载的文件。

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

MHT文件很好,但是它们通常使用base64来嵌入文件,这将使文件的大小大于它应该的大小(数据uri也是这样)。您可以将附件添加为二进制文件,但是您必须手动地使用十六进制编辑器,或者创建一个工具,并且客户端对它的支持可能不太好。

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

当然,如果您希望使用浏览器生成的内容,那么MHT(至少是Opera和IE)可能会更好。

#5


1  

i see no excuse to use anything other than a zipfile

我看不出有什么理由可以使用除zipfile以外的任何东西。

#6


0  

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.

嗯,如果浏览器支持和编辑的简便性是最大的问题,我认为您应该坚持使用file+directory方法,除非您愿意为单一的文件格式提供编辑器,并且在浏览器中没有很好的支持。

You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

可以通过压缩内容创建单个文件。您还可以创建父目录以简化处理。

#7


-1  

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"

问题是html是自下而上的,而不是自上而下的。看看你的文件名,它保存在我的框中,“保存完整的网页(图片等等)的最好的文件格式是什么?”——栈Overflow.html”

Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.

只要添加一个“|”,就很难将备份复制并粘贴到备用驱动器上。最终你会结束。删除文件名以保存它。数十或数百个相同的指数。html或索引。php正在阻塞我的驱动器。

The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

部分解决方案是编写自己的CMS,并使用脚本将所有相关文件映射到平面文件数据库——然后使用文件名、大小、mtime和md5获得每个文件的唯一Id。创建一个允许100k或1000k记录的平面文件索引。目标是写一次,使用多次。因此,您需要一个真正的CMS,您需要一个基于内容的唯一id(例如index8765432.html),该id位于您的files_archive中。其他同上。然后,您可以从保存的原始html到files_archive进行非破坏性的符号链接,并在需要时使用php或替代脚本重新创建文件。我不知道它是否能像你现在这样发挥作用——也许一周后就能知道了。更有用的方法是根据您的业务或个人需求和相关任务设置自顶向下的结构。所以你的文件可能是自上而下组织的,而外部文件则自下而上保存原始内容。我的兴趣是Web 3.0服务,越接近机器与机器交互,就越需要构造信息。也许是时候重新考虑将所有内容打包成一个文件的想法了。所以你有数百条主线。当一个自顶向下的解决方案可以让你修改一个文件而不是几百个的时候为什么要捆绑在一起。

#1


15  

My favourite is the ZIP format. Because:

我最喜欢的是ZIP格式。因为:

  • It is very well sutied for the purpose
  • 这是很有必要的。
  • It is well documented
  • 能够很好的证明
  • There a a lot of implementations available for creating or reading them
  • 有许多可用于创建或读取它们的实现
  • A user can easily extract single files, change them and put them back in the archive
  • 用户可以很容易地提取单个文件,更改它们并将它们放回归档。
  • Almost every major Operating System (Windows, Mac and most linux) have a ZIP program built in
  • 几乎每个主要的操作系统(Windows、Mac和大多数linux)都有一个内置的ZIP程序。

The alternatives all have some flaw:

所有的替代方案都有一些缺陷:

  • With MHTMl, you can not easily edit.
  • 使用MHTMl,您无法轻松编辑。
  • With data URI's, I don't know how difficult the implementation would be. (With ZIP, even I could do it in PHP, 3 years ago...)
  • 对于数据URI,我不知道实现起来有多难。(用ZIP,甚至我也可以用PHP, 3年前……)
  • The option to store things as seperate files just has far too many things that could go wrong and mess up your archive.
  • 将内容存储为独立文件的选项有太多的错误,可能会使归档变得混乱。

#2


4  

PDFs are supported on nearly all browsers on nearly all platforms and store content and images in a single file. They can be edited with the right tools. This is almost definitely not ideal, but it's an option to consider.

几乎所有平台上的所有浏览器都支持PDFs,并将内容和图像存储在一个文件中。它们可以用正确的工具进行编辑。这几乎肯定不是理想的,但这是一个可以考虑的选项。

#3


4  

It is not only question of file format. Another crucial question is what exactly you want to store? Is it:

这不仅仅是文件格式的问题。另一个关键问题是,你到底想要存储什么?它是:

  1. store whole page as it is with all referenced resources - images, CSS and javascript?

    将整个页面存储为所有引用的资源——图像、CSS和javascript?

  2. to capture page as it was rendered at some point in time; a static image of some rendered state of web page DOM?

    捕获在某个时间点上呈现的页面;一个静态的web页面DOM状态图像?

Most current "save page as" functionality in browser, be it to MAF or MHTML or file+dir, attempts the first way. This is ultimately flawed approach.

大多数当前的“保存页面”功能在浏览器中,无论是对MAF或MHTML或文件+目录,尝试第一种方式。这种做法最终是有缺陷的。

Don't forget web pages there days are rather local applications then a static document you can easily store. Potential issues:

不要忘记,现在的web页面都是本地应用程序,然后是可以轻松存储的静态文档。潜在的问题:

  1. one page is in fact several pages build dynamically by JS, user interaction is needed to get it to desired state

    一个页面实际上是几个由JS动态构建的页面,需要用户交互才能使其达到理想状态

  2. AJAX applications can do remote communication with remote service rendering it unusable for offline view.

    AJAX应用程序可以通过远程服务进行远程通信,使其不能用于脱机视图。

  3. Hidden links in javascript code. Such resource is then not part of stored page. Even parsing JS code may not discover them. You need to run the code.

    javascript代码中的隐藏链接。这样的资源就不是存储页面的一部分。即使解析JS代码也不会发现它们。您需要运行代码。

  4. Even position of basic html elements may be recomputed may be computed dynamically by JS and it is not always possible/easy to recreate it locally.

    甚至可以重新计算基本html元素的位置,也可以由JS动态计算,而且在本地重新创建它并不总是可能的/容易的。

  5. You would need some sort of JS memory dump and load this to get page to desired state you hoped to store

    您需要某种JS内存转储,并将其加载到希望存储的页面状态

And many many more issues...

还有更多的问题……

Check Chrome SingleFile extension. It stores a web page to one html file with images inlined using already mentioned data URIs. I haven't tested it much so I cannot say how well it handles "volatile" ajax pages.

检查Chrome SingleFile扩展。它将一个web页面存储到一个html文件中,其中的图像使用前面提到的数据uri进行内联。我还没有对它进行过多的测试,所以我不能说它处理“易变”ajax页面的能力有多强。

#4


3  

Use a zip file.

使用一个zip文件。

You could always make a program/script that extracts the zip file to a temp directory and loads the index.html file in your browser. You could even use an index.ini/txt file to specify the file that should be loaded when extracting.

您可以编写一个程序/脚本,将zip文件解压到一个临时目录并载入索引。浏览器中的html文件。你甚至可以使用索引。ini/txt文件指定提取时应该加载的文件。

Basically, you want something like the Mozilla Archive format, but without the unnecessary rdf crap just to specify what file to load.

基本上,您需要类似于Mozilla归档格式的东西,但是不需要不必要的rdf垃圾,只需指定要加载的文件。

MHT files are good, but they usually use base64 to embed files, which will make the file size bigger than it should be (data URIs are the same way). You can add attachments as binary, but you'll have to manually do that with a hex editor or create a tool and support for it by clients might not be as good.

MHT文件很好,但是它们通常使用base64来嵌入文件,这将使文件的大小大于它应该的大小(数据uri也是这样)。您可以将附件添加为二进制文件,但是您必须手动地使用十六进制编辑器,或者创建一个工具,并且客户端对它的支持可能不太好。

Of course, if you want to use what browsers generate, MHT (Opera and IE at least) might be better.

当然,如果您希望使用浏览器生成的内容,那么MHT(至少是Opera和IE)可能会更好。

#5


1  

i see no excuse to use anything other than a zipfile

我看不出有什么理由可以使用除zipfile以外的任何东西。

#6


0  

Well, if browser support and ease of editing are the biggest concerns I think you are stuck with the file+directory approach unless you are willing to provide an editor for the single file format and live with not very good support in browsers.

嗯,如果浏览器支持和编辑的简便性是最大的问题,我认为您应该坚持使用file+directory方法,除非您愿意为单一的文件格式提供编辑器,并且在浏览器中没有很好的支持。

You can create a single file by compressing the contents. You can also create a parent directory to ease handling.

可以通过压缩内容创建单个文件。您还可以创建父目录以简化处理。

#7


-1  

The problem is that html is bottoms up not top down. Look at your file name which saved on my box as "What's the best "file format" for saving complete web pages (images, etc.) in a single archive? - Stack Overflow.html"

问题是html是自下而上的,而不是自上而下的。看看你的文件名,它保存在我的框中,“保存完整的网页(图片等等)的最好的文件格式是什么?”——栈Overflow.html”

Just add a '|' and one has trouble doing copy and paste backups to a spare drive. In the end you end up. chopping the file name in order to save it. Dozens/ perhaps hundreds of identical index.html or index.php are cluttering my drives.

只要添加一个“|”,就很难将备份复制并粘贴到备用驱动器上。最终你会结束。删除文件名以保存它。数十或数百个相同的指数。html或索引。php正在阻塞我的驱动器。

The partial solution is to write you own CMS and use scripts to map all relevant files to a flat file database - then use fileName, size, mtime and md5 to get a unique Id for each file. Create a flat file index permitting 100k or 1000k records. The goal is to write once and use many times. So you need a real CMS you need a unique id based on content (eg index8765432.html) that goes in your files_archive. Ditto for the others. Then you can non-destructively symlink from the saved original html to the files_archive and just recreate the file using a php or alternative script if need be. Don't know if it will work as I'm at the same point you're at - maybe in a week will know for sure. The more useful approach is to have a top down structure based on your business or personal wants and related tasks. So your files might be organized top down but external ones bottom up to preserve the original content. My interest is in Web 3.0 services and the closer you get to machine to machine interaction the greater the need to structure the information. Maybe time to rethink the idea of bundling everything into a single file. So you have hundreds of main.css why bundle when a top down solution might let you modify one file instead of hundreds.

部分解决方案是编写自己的CMS,并使用脚本将所有相关文件映射到平面文件数据库——然后使用文件名、大小、mtime和md5获得每个文件的唯一Id。创建一个允许100k或1000k记录的平面文件索引。目标是写一次,使用多次。因此,您需要一个真正的CMS,您需要一个基于内容的唯一id(例如index8765432.html),该id位于您的files_archive中。其他同上。然后,您可以从保存的原始html到files_archive进行非破坏性的符号链接,并在需要时使用php或替代脚本重新创建文件。我不知道它是否能像你现在这样发挥作用——也许一周后就能知道了。更有用的方法是根据您的业务或个人需求和相关任务设置自顶向下的结构。所以你的文件可能是自上而下组织的,而外部文件则自下而上保存原始内容。我的兴趣是Web 3.0服务,越接近机器与机器交互,就越需要构造信息。也许是时候重新考虑将所有内容打包成一个文件的想法了。所以你有数百条主线。当一个自顶向下的解决方案可以让你修改一个文件而不是几百个的时候为什么要捆绑在一起。