在将HTML插入到数据库中而不是在输出中，这是一个坏主意吗?

I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.

我一直在研究一个不允许HTML格式的系统。我目前使用的方法是在HTML实体插入到数据库之前转义它们。我被告知应该将原始文本插入数据库，并在输出时转义HTML实体。

Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.

我在这里看到的其他类似的问题类似于HTML仍然可以用于格式化的情况，所以我要问的是HTML根本不会被使用的情况。

4 个解决方案

#1

you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.

在插入到db之前，您还将限制自己执行转义操作。假设您决定不使用HTML作为输出，而是使用JSON、明文等。

if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.

如果您在db中存储了转义html，则首先必须“释放”存储在db中的值，以便重新将其重新转义为不同的格式。

also see this perfect owasp article on xss prevention

也可以看看这篇关于xss预防的完美owasp文章

#2

Yes, because at some stage you'll want access to the original input entered. This is because...

是的，因为在某些阶段，您将希望访问输入的原始输入。这是因为……

You never know how you want to display it - in JSON, in HTML, as an SMS?
你永远不知道如何显示它——JSON, HTML, SMS?
You may need to show it back to the user as is.
您可能需要将其显示给用户。

I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...

我明白你的意思，不要输入HTML。你用什么来去除HTML标签?如果它是一个regex，那么请注意可能输入类似以下内容的混乱用户……

3<4 :->

They'll only get the 3 if it is a regex.

如果是正则表达式，他们只能得到3。

#3

Another elusive issue: Suppose you are entering a record with the string R&B in it's title. It will be stored as R&B. And assume we have a search function which uses the SQL:

另一个难以捉摸的问题是:假设您正在输入一个带有string R&B标题的记录。它将被存储为R&B。假设我们有一个使用SQL的搜索函数:
```
$query = $database->prepare('SELECT * FROM table WHERE title LIKE ?');
$query->execute(array($searchString.'%'));    
```
Now if someone searches R&B, it won't match this row, as it is stored as R&B. The situation is the same for equality, sorting etc.

如果有人搜索R&B，它将不匹配这一行，因为它存储为R&B。平等、分类等情况也是一样的。

Of course, here we have the issue of not searching HTML tags, as <span>'s will be matching when someone searches for span. This could be solved by delegating the search functionality to some external service like Solr, or by storing a version in a second field which is cleared of HTML tags, special characters and such (for full text search) similar to what @limscoder suggested.

当然，这里我们有不搜索HTML标签的问题，因为当有人搜索span时's将会匹配。可以通过将搜索功能委托给Solr之类的外部服务来解决这个问题，或者将一个版本存储在第二个字段中，该字段不包含HTML标记、特殊字符等(对于全文搜索)，类似于@limscoder的建议。
One day you may be exposing your data via an API or something, and your API users may assume it un-escaped.

有一天，您可能通过API或其他方式公开您的数据，您的API用户可能会认为它没有转义。
A few months later, a new team member joins. As a well trained developer, he always uses html escaping, now only to see everything is double-escaped (e.g. there are titles showing up like He said "nuff" instead of He said "nuff").

几个月后，一个新的团队成员加入了。作为一名训练有素的开发人员，他总是使用html转义，现在只看到所有东西都是双转义的(例如，出现了一些标题，就像他说的“nuff;”;而不是他说的“nuff”。
Quote style of htmlspecialchars() (e.g. ENT_QUOTES, ENT_COMPAT etc) is going to bite you, if you are using anything other than the default one and forget to use the same quoting style in both storing/outputting.

htmlspecialchars()的引号样式(例如ENT_QUOTES、ENT_COMPAT等)会让您吃不消，如果您正在使用除默认值之外的任何东西，而忘记在存储/输出中使用相同的引号样式的话。

A similar issue happens when you use htmlentities() to store, and htmlspecialchars() to output, or vice versa (with corresponding counter-functions). Your HTML will be polluted with Üs, Çs etc.

当您使用htmlentities()存储和htmlspecialchars()输出时，也会发生类似的问题，反之亦然(具有相应的反函数)。你的HTML将会被污染。

These are more prone to be abused if there are multiple developers working on the same codebase.

如果有多个开发人员在同一代码基上工作，则更容易滥用这些代码。

#4

I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.

我通常存储两个版本的文本。当正常的页面请求被用来避免每次的转义/格式化的开销时，就会使用转义/格式化的文本。当用户需要编辑现有条目时使用原始/原始文本，转义/格式化只在创建或更改文本时发生。这种策略非常有效，除非您有严格的存储空间限制，因为您将复制数据。

#1