Java / Apache Tika:如何从URL获取文件的最后修改/创建属性

时间:2023-01-26 16:21:46

I want to use Java to get the last modified time and the creation time of a file on an HTTP server. The file is located at a specific URL. The methods using URLConnection and HttpURLConnection yield the Last-Modified attribute from the HTTP header, but this is not the actual creation date of the file.

我想使用Java来获取HTTP服务器上文件的上次修改时间和创建时间。该文件位于特定URL。使用URLConnection和HttpURLConnection的方法从HTTP标头生成Last-Modified属性,但这不是文件的实际创建日期。

I have been reading that Apache Tika is the library for the job. However, I have not been able to find a working example that does what I want. The closest example is perhaps here. But when I run the code given in that post, it does not yield the last modified attribute.

我一直在读Apache Tika是这个工作的图书馆。但是,我无法找到一个能够满足我想要的工作示例。最近的例子也许就在这里。但是当我运行该帖子中给出的代码时,它不会产生最后修改的属性。

I'm using partly an approach given in this answer that I think might work, but currently does not print anything.

我正在使用的部分方法在这个答案中我认为可行,但目前不打印任何东西。

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();


URI u = new URI("https://sec.gov/Archives/edgar/full-index/2015/QTR4/master.idx");
InputStream is = new BufferedInputStream(new FileInputStream(new File(u)));

parser.parse(is, handler, metadata, new ParseContext());
System.out.println("Creation Date" + metadata.get(Metadata.CREATION_DATE));
System.out.println("Last Modified Date" + metadata.get(Metadata.LAST_MODIFIED));

1 个解决方案

#1


0  

When you are downloading the file using URLConnection, the HTTP headers are hidden from Tika.

当您使用URLConnection下载文件时,会从Tika隐藏HTTP标头。

All what Tika can read here is the same than if you had saved your file and opened a stream on it

Tika在这里读到的所有内容都与保存文件并在其上打开流相同

It means that creation date and last modified will be the ones used when saving the file (the same that you can see using your OS browser [Windows explorer, nautilus...]).

这意味着创建日期和上次修改将是保存文件时使用的日期(与您使用OS浏览器[Windows资源管理器,nautilus ...]相同)。

If you need to read HTTP headers on that file and only that, do not use Tika but either directly the HTTPUrlConnection or any other HTTP client like (https://hc.apache.org/httpcomponents-client-4.5.x/) or the methods proposed in this other question.

如果您需要读取该文件上的HTTP标头,请不要使用Tika,而是直接使用HTTPUrlConnection或任何其他HTTP客户端,如(https://hc.apache.org/httpcomponents-client-4.5.x/)或在另一个问题中提出的方法。

#1


0  

When you are downloading the file using URLConnection, the HTTP headers are hidden from Tika.

当您使用URLConnection下载文件时,会从Tika隐藏HTTP标头。

All what Tika can read here is the same than if you had saved your file and opened a stream on it

Tika在这里读到的所有内容都与保存文件并在其上打开流相同

It means that creation date and last modified will be the ones used when saving the file (the same that you can see using your OS browser [Windows explorer, nautilus...]).

这意味着创建日期和上次修改将是保存文件时使用的日期(与您使用OS浏览器[Windows资源管理器,nautilus ...]相同)。

If you need to read HTTP headers on that file and only that, do not use Tika but either directly the HTTPUrlConnection or any other HTTP client like (https://hc.apache.org/httpcomponents-client-4.5.x/) or the methods proposed in this other question.

如果您需要读取该文件上的HTTP标头,请不要使用Tika,而是直接使用HTTPUrlConnection或任何其他HTTP客户端,如(https://hc.apache.org/httpcomponents-client-4.5.x/)或在另一个问题中提出的方法。