如何在Java中安全地编码字符串以用作文件名?

时间:2021-12-19 06:51:35

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

我从外部进程收到一个字符串。我想使用该String来创建文件名,然后写入该文件。这是我的代码片段:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

如果s包含无效字符,例如基于Unix的OS中的“/”,则(正确地)抛出java.io.FileNotFoundException。

How can I safely encode the String so that it can be used as a filename?

如何安全地编码String以便它可以用作文件名?

Edit: What I'm hoping for is an API call that does this for me.

编辑:我希望的是一个API调用,它为我做这个。

I can do this:

我可以做这个:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.

但我不确定URLEncoder是否可靠用于此目的。

9 个解决方案

#1


11  

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. If collisions must be avoided, then simple replacement or removal of "bad" characters is not the answer either.

如果您希望结果与原始文件类似,则SHA-1或任何其他哈希方案不是答案。如果必须避免碰撞,那么简单地替换或删除“坏”字符也不是答案。

Instead you want something like this.

相反,你想要这样的东西。

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

该解决方案提供可逆编码(没有冲突),其中编码的字符串在大多数情况下类似于原始字符串。我假设您使用的是8位字符。

URLEncoder works, but it has the disadvantage that it encodes a whole lot of legal file name characters.

URLEncoder有效,但它的缺点是它编码了大量合法的文件名字符。

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.

如果您想要一个不保证可逆的解决方案,那么只需删除“坏”字符,而不是用转义序列替换它们。

#2


87  

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

我的建议是采用“白名单”方法,这意味着不要尝试过滤掉不良角色。而是定义什么是好的。您可以拒绝文件名或过滤它。如果你想过滤它:

String name = s.replaceAll("\\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

这样做是替换任何不是数字,字母或下划线的字符。或者,您可以用另一个字符(如下划线)替换它们。

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

问题是如果这是一个共享目录,那么你不希望文件名冲突。即使用户隔离了用户存储区域,您也可能只是通过过滤掉不良字符来结束冲突的文件名。如果用户想要下载它,那么用户输入的名称通常很有用。

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

出于这个原因,我倾向于允许用户输入他们想要的内容,根据我自己选择的方案存储文件名(例如userId_fileId),然后将用户的文件名存储在数据库表中。这样,您可以将其显示回用户,存储您想要的内容,并且不会危及安全性或消除其他文件。

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

您也可以对文件进行哈希处理(例如MD5哈希),但是您无法列出用户输入的文件(无论如何都没有有意义的名称)。

EDIT:Fixed regex for java

编辑:修复了java的正则表达式

#3


34  

It depends on whether the encoding should be reversible or not.

这取决于编码是否应该是可逆的。

Reversible

可逆

Use URL encoding (java.net.URLEncoder) to replace special characters with %xx. Note that you take care of the special cases where the string equals ., equals .. or is empty!¹ Many programs use URL encoding to create file names, so this is a standard technique which everybody understands.

使用URL编码(java.net.URLEncoder)将特殊字符替换为%xx。请注意,您要处理字符串等于。,等于..或为空的特殊情况!¹许多程序使用URL编码来创建文件名,因此这是每个人都能理解的标准技术。

Irreversible

不可逆

Use a hash (e.g. SHA-1) of the given string. Modern hash algorithms (not MD5) can be considered collision-free. In fact, you'll have a break-through in cryptography if you find a collision.

使用给定字符串的哈希值(例如SHA-1)。现代哈希算法(不是MD5)可以被认为是无冲突的。事实上,如果发现碰撞,您将在密码学方面取得突破。


¹ You can handle all 3 special cases elegantly by using a prefix such as "myApp-". If you put the file directly into $HOME, you'll have to do that anyway to avoid conflicts with existing files such as ".bashrc".
public static String encodeFilename(String s)
{
    try
    {
        return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8");
    }
    catch (java.io.UnsupportedEncodingException e)
    {
        throw new RuntimeException("UTF-8 is an unknown encoding!?");
    }
}

#4


13  

Here's what I use:

这是我使用的:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

它的作用是使用正则表达式替换每个不是字母,数字,下划线或带下划线的点的字符。

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

这意味着“如何将£转换为$”之类的内容将变为“How_to_convert___to__”。不可否认,这个结果不是非常用户友好,但它是安全的,并且保证生成的目录/文件名在任何地方都可以使用。在我的情况下,结果不会显示给用户,因此不是问题,但您可能希望将正则表达式更改为更宽松。

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.

值得注意的是,我遇到的另一个问题是我有时会得到相同的名称(因为它基于用户输入),所以你应该知道这一点,因为你不能在一个目录中有多个同名的目录/文件。此外,您可能需要截断或缩短生成的字符串,因为它可能超过某些系统具有的255个字符限制。

#5


12  

For those looking for a general solution, these might be common critera:

对于那些寻找通用解决方案的人来说,这些可能是常见的标准:

  • The filename should resemble the string.
  • 文件名应该类似于字符串。
  • The encoding should be reversible where possible.
  • 在可能的情况下,编码应该是可逆的。
  • The probability of collisions should be minimized.
  • 应尽量减少碰撞的可能性。

To achieve this we can use regex to match illegal characters, percent-encode them, then constrain the length of the encoded string.

为了实现这一点,我们可以使用正则表达式匹配非法字符,对它们进行百分比编码,然后约束编码字符串的长度。

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-]");

private static final int MAX_LENGTH = 127;

public static String escapeStringAsFilename(String in){

    StringBuffer sb = new StringBuffer();

    // Apply the regex.
    Matcher m = PATTERN.matcher(in);

    while (m.find()) {

        // Convert matched character to percent-encoded.
        String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase();

        m.appendReplacement(sb,replacement);
    }
    m.appendTail(sb);

    String encoded = sb.toString();

    // Truncate the string.
    int end = Math.min(encoded.length(),MAX_LENGTH);
    return encoded.substring(0,end);
}

Patterns

模式

The pattern above is based on a conservative subset of allowed characters in the POSIX spec.

上面的模式基于POSIX规范中允许字符的保守子集。

If you want to allow the dot character, use:

如果要允许点字符,请使用:

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-\\.]");

Just be wary of strings like "." and ".."

只要警惕像“。”这样的字符串。和“......”

If you want to avoid collisions on case insensitive filesystems, you'll need to escape capitals:

如果要避免在不区分大小写的文件系统上发生冲突,则需要转义大写:

private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]");

Or escape lower case letters:

或者逃避小写字母:

private static final Pattern PATTERN = Pattern.compile("[^A-Z0-9_\\-]");

Rather than using a whitelist, you may choose to blacklist reserved characters for your specific filesystem. E.G. This regex suits FAT32 filesystems:

您可以选择将特定文件系统的保留字符列入黑名单,而不是使用白名单。例如。这个正则表达式适合FAT32文件系统:

private static final Pattern PATTERN = Pattern.compile("[%\\.\"\\*/:<>\\?\\\\\\|\\+,\\.;=\\[\\]]");

Length

长度

On Android, 127 characters is the safe limit. Many filesystems allow 255 characters.

在Android上,127个字符是安全限制。许多文件系统允许255个字符。

If you prefer to retain the tail, rather than the head of your string, use:

如果您更喜欢保留尾部,而不是字符串的头部,请使用:

// Truncate the string.
int start = Math.max(0,encoded.length()-MAX_LENGTH);
return encoded.substring(start,encoded.length());

Decoding

解码

To convert the filename back to the original string, use:

要将文件名转换回原始字符串,请使用:

URLDecoder.decode(filename, "UTF-8");

Limitations

限制

Because longer strings are truncated, there is the possibility of a name collision when encoding, or corruption when decoding.

由于较长的字符串被截断,因此编码时可能会发生名称冲突,或者在解码时可能会出现损坏。

#6


4  

Try using the following regex which replaces every invalid file name character with a space:

尝试使用以下正则表达式,用空格替换每个无效的文件名字符:

public static String toValidFileName(String input)
{
    return input.replaceAll("[:\\\\/*\"?|<>']", " ");
}

#7


4  

Pick your poison from the options presented by commons-codec, example:

从commons-codec提供的选项中选择你的毒药,例如:

String safeFileName = DigestUtils.sha(filename);

#8


1  

This is probably not the most effective way, but shows how to do it using Java 8 pipelines:

这可能不是最有效的方法,但展示了如何使用Java 8管道:

private static String sanitizeFileName(String name) {
    return name
            .chars()
            .mapToObj(i -> (char) i)
            .map(c -> Character.isWhitespace(c) ? '_' : c)
            .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_')
            .map(String::valueOf)
            .collect(Collectors.joining());
}

The solution could be improved by creating custom collector which uses StringBuilder, so you do not have to cast each light-weight character to a heavy-weight string.

可以通过创建使用StringBuilder的自定义收集器来改进解决方案,因此您不必将每个轻量级字符转换为重量级字符串。

#9


0  

You could remove the invalid chars ( '/', '\', '?', '*') and then use it.

您可以删除无效字符('/','\','?','*'),然后使用它。

#1


11  

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. If collisions must be avoided, then simple replacement or removal of "bad" characters is not the answer either.

如果您希望结果与原始文件类似,则SHA-1或任何其他哈希方案不是答案。如果必须避免碰撞,那么简单地替换或删除“坏”字符也不是答案。

Instead you want something like this.

相反,你想要这样的东西。

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

该解决方案提供可逆编码(没有冲突),其中编码的字符串在大多数情况下类似于原始字符串。我假设您使用的是8位字符。

URLEncoder works, but it has the disadvantage that it encodes a whole lot of legal file name characters.

URLEncoder有效,但它的缺点是它编码了大量合法的文件名字符。

If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with escape sequences.

如果您想要一个不保证可逆的解决方案,那么只需删除“坏”字符,而不是用转义序列替换它们。

#2


87  

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

我的建议是采用“白名单”方法,这意味着不要尝试过滤掉不良角色。而是定义什么是好的。您可以拒绝文件名或过滤它。如果你想过滤它:

String name = s.replaceAll("\\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

这样做是替换任何不是数字,字母或下划线的字符。或者,您可以用另一个字符(如下划线)替换它们。

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

问题是如果这是一个共享目录,那么你不希望文件名冲突。即使用户隔离了用户存储区域,您也可能只是通过过滤掉不良字符来结束冲突的文件名。如果用户想要下载它,那么用户输入的名称通常很有用。

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

出于这个原因,我倾向于允许用户输入他们想要的内容,根据我自己选择的方案存储文件名(例如userId_fileId),然后将用户的文件名存储在数据库表中。这样,您可以将其显示回用户,存储您想要的内容,并且不会危及安全性或消除其他文件。

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

您也可以对文件进行哈希处理(例如MD5哈希),但是您无法列出用户输入的文件(无论如何都没有有意义的名称)。

EDIT:Fixed regex for java

编辑:修复了java的正则表达式

#3


34  

It depends on whether the encoding should be reversible or not.

这取决于编码是否应该是可逆的。

Reversible

可逆

Use URL encoding (java.net.URLEncoder) to replace special characters with %xx. Note that you take care of the special cases where the string equals ., equals .. or is empty!¹ Many programs use URL encoding to create file names, so this is a standard technique which everybody understands.

使用URL编码(java.net.URLEncoder)将特殊字符替换为%xx。请注意,您要处理字符串等于。,等于..或为空的特殊情况!¹许多程序使用URL编码来创建文件名,因此这是每个人都能理解的标准技术。

Irreversible

不可逆

Use a hash (e.g. SHA-1) of the given string. Modern hash algorithms (not MD5) can be considered collision-free. In fact, you'll have a break-through in cryptography if you find a collision.

使用给定字符串的哈希值(例如SHA-1)。现代哈希算法(不是MD5)可以被认为是无冲突的。事实上,如果发现碰撞,您将在密码学方面取得突破。


¹ You can handle all 3 special cases elegantly by using a prefix such as "myApp-". If you put the file directly into $HOME, you'll have to do that anyway to avoid conflicts with existing files such as ".bashrc".
public static String encodeFilename(String s)
{
    try
    {
        return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8");
    }
    catch (java.io.UnsupportedEncodingException e)
    {
        throw new RuntimeException("UTF-8 is an unknown encoding!?");
    }
}

#4


13  

Here's what I use:

这是我使用的:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

它的作用是使用正则表达式替换每个不是字母,数字,下划线或带下划线的点的字符。

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

这意味着“如何将£转换为$”之类的内容将变为“How_to_convert___to__”。不可否认,这个结果不是非常用户友好,但它是安全的,并且保证生成的目录/文件名在任何地方都可以使用。在我的情况下,结果不会显示给用户,因此不是问题,但您可能希望将正则表达式更改为更宽松。

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.

值得注意的是,我遇到的另一个问题是我有时会得到相同的名称(因为它基于用户输入),所以你应该知道这一点,因为你不能在一个目录中有多个同名的目录/文件。此外,您可能需要截断或缩短生成的字符串,因为它可能超过某些系统具有的255个字符限制。

#5


12  

For those looking for a general solution, these might be common critera:

对于那些寻找通用解决方案的人来说,这些可能是常见的标准:

  • The filename should resemble the string.
  • 文件名应该类似于字符串。
  • The encoding should be reversible where possible.
  • 在可能的情况下,编码应该是可逆的。
  • The probability of collisions should be minimized.
  • 应尽量减少碰撞的可能性。

To achieve this we can use regex to match illegal characters, percent-encode them, then constrain the length of the encoded string.

为了实现这一点,我们可以使用正则表达式匹配非法字符,对它们进行百分比编码,然后约束编码字符串的长度。

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-]");

private static final int MAX_LENGTH = 127;

public static String escapeStringAsFilename(String in){

    StringBuffer sb = new StringBuffer();

    // Apply the regex.
    Matcher m = PATTERN.matcher(in);

    while (m.find()) {

        // Convert matched character to percent-encoded.
        String replacement = "%"+Integer.toHexString(m.group().charAt(0)).toUpperCase();

        m.appendReplacement(sb,replacement);
    }
    m.appendTail(sb);

    String encoded = sb.toString();

    // Truncate the string.
    int end = Math.min(encoded.length(),MAX_LENGTH);
    return encoded.substring(0,end);
}

Patterns

模式

The pattern above is based on a conservative subset of allowed characters in the POSIX spec.

上面的模式基于POSIX规范中允许字符的保守子集。

If you want to allow the dot character, use:

如果要允许点字符,请使用:

private static final Pattern PATTERN = Pattern.compile("[^A-Za-z0-9_\\-\\.]");

Just be wary of strings like "." and ".."

只要警惕像“。”这样的字符串。和“......”

If you want to avoid collisions on case insensitive filesystems, you'll need to escape capitals:

如果要避免在不区分大小写的文件系统上发生冲突,则需要转义大写:

private static final Pattern PATTERN = Pattern.compile("[^a-z0-9_\\-]");

Or escape lower case letters:

或者逃避小写字母:

private static final Pattern PATTERN = Pattern.compile("[^A-Z0-9_\\-]");

Rather than using a whitelist, you may choose to blacklist reserved characters for your specific filesystem. E.G. This regex suits FAT32 filesystems:

您可以选择将特定文件系统的保留字符列入黑名单,而不是使用白名单。例如。这个正则表达式适合FAT32文件系统:

private static final Pattern PATTERN = Pattern.compile("[%\\.\"\\*/:<>\\?\\\\\\|\\+,\\.;=\\[\\]]");

Length

长度

On Android, 127 characters is the safe limit. Many filesystems allow 255 characters.

在Android上,127个字符是安全限制。许多文件系统允许255个字符。

If you prefer to retain the tail, rather than the head of your string, use:

如果您更喜欢保留尾部,而不是字符串的头部,请使用:

// Truncate the string.
int start = Math.max(0,encoded.length()-MAX_LENGTH);
return encoded.substring(start,encoded.length());

Decoding

解码

To convert the filename back to the original string, use:

要将文件名转换回原始字符串,请使用:

URLDecoder.decode(filename, "UTF-8");

Limitations

限制

Because longer strings are truncated, there is the possibility of a name collision when encoding, or corruption when decoding.

由于较长的字符串被截断,因此编码时可能会发生名称冲突,或者在解码时可能会出现损坏。

#6


4  

Try using the following regex which replaces every invalid file name character with a space:

尝试使用以下正则表达式,用空格替换每个无效的文件名字符:

public static String toValidFileName(String input)
{
    return input.replaceAll("[:\\\\/*\"?|<>']", " ");
}

#7


4  

Pick your poison from the options presented by commons-codec, example:

从commons-codec提供的选项中选择你的毒药,例如:

String safeFileName = DigestUtils.sha(filename);

#8


1  

This is probably not the most effective way, but shows how to do it using Java 8 pipelines:

这可能不是最有效的方法,但展示了如何使用Java 8管道:

private static String sanitizeFileName(String name) {
    return name
            .chars()
            .mapToObj(i -> (char) i)
            .map(c -> Character.isWhitespace(c) ? '_' : c)
            .filter(c -> Character.isLetterOrDigit(c) || c == '-' || c == '_')
            .map(String::valueOf)
            .collect(Collectors.joining());
}

The solution could be improved by creating custom collector which uses StringBuilder, so you do not have to cast each light-weight character to a heavy-weight string.

可以通过创建使用StringBuilder的自定义收集器来改进解决方案,因此您不必将每个轻量级字符转换为重量级字符串。

#9


0  

You could remove the invalid chars ( '/', '\', '?', '*') and then use it.

您可以删除无效字符('/','\','?','*'),然后使用它。