如何检测文件是PDF还是TIFF？

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.

请耐心等待，因为我在不知道所有背景的情况下被扔到了这个项目的中间。如果你有WTF问题，请相信我，我也有。

Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.

这是一个场景：我有一堆文件驻留在IIS服务器上。他们没有文件扩展名。只是名称为“asda-2342-sd3rs-asd24-ut57”等的裸文件。没什么直观的。

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.

问题是我需要在ASP.NET（2.0）页面上提供文件，并将tiff文件显示为tiff，将PDF文件显示为PDF。不幸的是，我不知道哪个是哪个，我需要能够以各自的格式适当地显示它们。

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

例如，假设我需要显示2个文件，一个是tiff，一个是PDF。页面应该显示tiff图像，也许是一个链接，可以在新的选项卡/窗口中打开PDF。

The problem:

问题：

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.

由于这些文件都是无扩展名，因此我必须强制IIS将所有内容作为TIFF提供。但是，如果我这样做，PDF文件将不会显示。我可以更改IIS以强制MIME类型为未知文件扩展名的PDF，但我有相反的问题。

http://support.microsoft.com/kb/326965

Is this problem easier than I think or is it as nasty as I am expecting?

这个问题比我想象的容易吗？还是像我期待的那样令人讨厌？

8 个解决方案

#1

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

好吧，有足够多的人弄错了我要发布一些我必须识别TIFF的代码：

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

I hacked apart some much more general code to get this.

为了得到这个，我分解了一些更通用的代码。

For PDF, I have code that looks like this:

对于PDF，我的代码如下所示：

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

现在，GetToken（）是对扫描程序的调用，它将Stream标记为PDF标记。这是非常重要的，所以我不打算在这里粘贴它。我正在使用tokenizer而不是查看substring来避免这样的问题：

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

上面的代码片段将此代码标记为非PDF，而更简单的代码块将错误地将其标记为PDF。

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

我还应该指出，目前的ISO规范没有以前Adobe拥有的规范中的实现说明。最重要的是来自PDF参考，版本1.6：

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

#2

TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/

可以通过查看第一个字节来检测TIFF http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/

The first 8 bytes forms the header. The first two bytes of which is either "II" for little endian byte ordering or "MM" for big endian byte ordering.

前8个字节构成标题。其前两个字节是用于小端字节排序的“II”或用于大端字节排序的“MM”。

About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf

关于PDF：http：//www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf

The header contains just one line that identifies the version of PDF. Example: %PDF-1.6

标题只包含一行标识PDF的版本。示例：％PDF-1.6

#3

Reading the specification for each file format will tell you how to identify files of that format.

阅读每种文件格式的规范将告诉您如何识别该格式的文件。

TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.

TIFF文件 - 检查字节1和2是否为0x4D4D或0x4949，字节2-3检查值为“42”。

Page 13 of the spec reads:

该规范的第13页内容如下：

A TIFF file begins with an 8-byte image file header, containing the following information: Bytes 0-1: The byte order used within the file. Legal values are: “II” (4949.H) “MM” (4D4D.H) In the “II” format, byte order is always from the least significant byte to the most significant byte, for both 16-bit and 32-bit integers This is called little-endian byte order. In the “MM” format, byte order is always from most significant to least significant, for both 16-bit and 32-bit integers. This is called big-endian byte order. Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file. The byte order depends on the value of Bytes 0-1.

TIFF文件以8字节图像文件头开头，包含以下信息：字节0-1：文件中使用的字节顺序。合法值为：“II”（4949.H）“MM”（4D4D.H）在“II”格式中，对于16位和32位，字节顺序始终从最低有效字节到最高有效字节位整数这称为little-endian字节顺序。在“MM”格式中，对于16位和32位整数，字节顺序始终从最高有效到最低有效。这称为big-endian字节顺序。字节2-3一个任意但精心选择的数字（42），它进一步将文件标识为TIFF文件。字节顺序取决于字节0-1的值。

PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)

PDF文件以PDF版本开头，后跟几个二进制字节。（我想你现在必须购买当前版本的ISO规范。）

Section 7.5.2

第7.5.2节

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7. A conforming reader shall accept files with any of the following headers: %PDF–1.0, %PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4, %PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.

PDF文件的第一行应为包含5个字符％PDF的标题，后跟形式为1.N的版本号，其中N为0到7之间的数字。符合本标准的读者应接受任何文件以下标题：％PDF-1.0，％PDF-1.1，％PDF-1.2，％PDF-1.3，％PDF-1.4，％PDF-1.5，％PDF-1.6，％PDF-1.7从PDF 1.4开始，版本如果存在，则应使用文档目录字典中的条目（通过文件预告片中的Root条目定位，如7.5.5“文件预告片”中所述），而不是标题中指定的版本。

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

如果PDF文件包含二进制数据，就像大多数情况一样（参见7.2，“词汇约定”），标题行后面应紧跟一个包含至少四个二进制字符的注释行 - 即代码为128或更大的字符。这确保了文件传输应用程序的正确行为，该应用程序检查文件开头附近的数据，以确定是将文件的内容视为文本还是二进制文件。

Of course you could do a "deeper" check on each file by checking more file specific items.

当然，您可以通过检查更多文件特定项来对每个文件进行“更深入”检查。

#4

A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html

一个非常有用的文件签名列表，即Gary Kessler的“神奇数字”，可以访问http://www.garykessler.net/library/file_sigs.html

#5

Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.

在内部，文件头信息应该有所帮助。如果你打开一个低级文件，比如StreamReader（）或FOPEN（），请查看文件中的前两个字符......几乎每种文件类型都有自己的签名。

PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"

I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.

我过去也必须处理这个问题...也有助于防止不需要的文件上传到给定的网站，并在检查后立即中止。

EDIT -- Posted sample code to read and test file header types

编辑 - 发布示例代码以读取和测试文件头类型

String fn = "Example.pdf";

StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
    + buf[1].ToString()
    + buf[2].ToString()
    + buf[3].ToString()
    + buf[4].ToString();

String WhatType;
if (Hdr.StartsWith("%PDF"))
   WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
   WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
   WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
   WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
   WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
   WhatType = "Cursor (.cur)";
else
   WhatType = "Unknown";

#6

If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.

如果你去这里，你会看到TIFF通常以“幻数”0x49 0x49 0x2A 0x00（还给出一些其他定义）开头，这是文件的前4个字节。

So just use these first 4 bytes to determine whether file is TIFF or not.

因此，只需使用前4个字节来确定文件是否为TIFF。

EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source

编辑，以其他方式执行此操作可能更好，并首先检测PDF。 PDF的神奇数字更加标准化：正如Plinth所指出的那样，它们以“％PDF”开头，位于前1024字节（0x25 0x50 0x44 0x46）的某处。资源

#7

You are going to have to write an ashx to get the file requested.

你将不得不写一个ashx来获取所请求的文件。

then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.

然后，你的处理程序应该读取前几个字节（或左右）以确定文件类型到底是什么 - PDF和TIFF在文件的开头有“魔术数”，你可以用它来确定这个，然后设置你的响应相应的标题。

#8

you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :

你可以使用Myrmec来识别文件类型，这个库使用文件字节头。这个库可用于nuget“Myrmec”，这是repo，myrmec也支持mime类型，你可以尝试一下。代码将是这样的：

// create a sniffer instance.
Sniffer sniffer = new Sniffer();

// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);

// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();

// start match.
List<string> results = sniffer.Match(fileHead);

and get mime type :

并获取mime类型：

List<string> result = sniffer.Match(head);

string mimeType = MimeTypes.GetMimeType(result.First());

string mimeType = MimeTypes.GetMimeType（result.First（））;

but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo

但是支持tiff只有“49 49 2A 00”和“4D 4D 00 2A”两个签名，如果你有更多你可以添加自己，可能你可以看到myrmec的自述文件求助。 myrmec github repo

#1