I've been working on a system which doesn't allow HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told that I should insert the raw text into the database, and escape HTML entities on output.
我一直在研究一个不允许HTML格式的系统。我目前使用的方法是在HTML实体插入到数据库之前转义它们。我被告知应该将原始文本插入数据库,并在输出时转义HTML实体。
Other similar questions here I've seen look like for cases where HTML can still be used for formatting, so I'm asking for a case where HTML wouldn't be used at all.
我在这里看到的其他类似的问题类似于HTML仍然可以用于格式化的情况,所以我要问的是HTML根本不会被使用的情况。
4 个解决方案
#1
14
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
在插入到db之前,您还将限制自己执行转义操作。假设您决定不使用HTML作为输出,而是使用JSON、明文等。
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
如果您在db中存储了转义html,则首先必须“释放”存储在db中的值,以便重新将其重新转义为不同的格式。
also see this perfect owasp article on xss prevention
也可以看看这篇关于xss预防的完美owasp文章
#2
17
Yes, because at some stage you'll want access to the original input entered. This is because...
是的,因为在某些阶段,您将希望访问输入的原始输入。这是因为……
- You never know how you want to display it - in JSON, in HTML, as an SMS?
- 你永远不知道如何显示它——JSON, HTML, SMS?
- You may need to show it back to the user as is.
- 您可能需要将其显示给用户。
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
我明白你的意思,不要输入HTML。你用什么来去除HTML标签?如果它是一个regex,那么请注意可能输入类似以下内容的混乱用户……
3<4 :->
They'll only get the 3
if it is a regex.
如果是正则表达式,他们只能得到3。
#3
4
-
Another elusive issue: Suppose you are entering a record with the string
R&B
in it's title. It will be stored asR&B
. And assume we have a search function which uses the SQL:另一个难以捉摸的问题是:假设您正在输入一个带有string R&B标题的记录。它将被存储为R&B。假设我们有一个使用SQL的搜索函数:
$query = $database->prepare('SELECT * FROM table WHERE title LIKE ?'); $query->execute(array($searchString.'%'));
Now if someone searches
R&B
, it won't match this row, as it is stored asR&B
. The situation is the same for equality, sorting etc.如果有人搜索R&B,它将不匹配这一行,因为它存储为R&B。平等、分类等情况也是一样的。
Of course, here we have the issue of not searching HTML tags, as
<span>
's will be matching when someone searches forspan
. This could be solved by delegating the search functionality to some external service like Solr, or by storing a version in a second field which is cleared of HTML tags, special characters and such (for full text search) similar to what @limscoder suggested.当然,这里我们有不搜索HTML标签的问题,因为当有人搜索span时's将会匹配。可以通过将搜索功能委托给Solr之类的外部服务来解决这个问题,或者将一个版本存储在第二个字段中,该字段不包含HTML标记、特殊字符等(对于全文搜索),类似于@limscoder的建议。
-
One day you may be exposing your data via an API or something, and your API users may assume it un-escaped.
有一天,您可能通过API或其他方式公开您的数据,您的API用户可能会认为它没有转义。
-
A few months later, a new team member joins. As a well trained developer, he always uses html escaping, now only to see everything is double-escaped (e.g. there are titles showing up like
He said "nuff"
instead ofHe said "nuff"
).几个月后,一个新的团队成员加入了。作为一名训练有素的开发人员,他总是使用html转义,现在只看到所有东西都是双转义的(例如,出现了一些标题,就像他说的“nuff;”;而不是他说的“nuff”。
-
Quote style of
htmlspecialchars()
(e.g.ENT_QUOTES
,ENT_COMPAT
etc) is going to bite you, if you are using anything other than the default one and forget to use the same quoting style in both storing/outputting.htmlspecialchars()的引号样式(例如ENT_QUOTES、ENT_COMPAT等)会让您吃不消,如果您正在使用除默认值之外的任何东西,而忘记在存储/输出中使用相同的引号样式的话。
A similar issue happens when you use
htmlentities()
to store, andhtmlspecialchars()
to output, or vice versa (with corresponding counter-functions). Your HTML will be polluted withÜ
s,Ç
s etc.当您使用htmlentities()存储和htmlspecialchars()输出时,也会发生类似的问题,反之亦然(具有相应的反函数)。你的HTML将会被污染。
These are more prone to be abused if there are multiple developers working on the same codebase.
如果有多个开发人员在同一代码基上工作,则更容易滥用这些代码。
#4
3
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.
我通常存储两个版本的文本。当正常的页面请求被用来避免每次的转义/格式化的开销时,就会使用转义/格式化的文本。当用户需要编辑现有条目时使用原始/原始文本,转义/格式化只在创建或更改文本时发生。这种策略非常有效,除非您有严格的存储空间限制,因为您将复制数据。
#1
14
you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.
在插入到db之前,您还将限制自己执行转义操作。假设您决定不使用HTML作为输出,而是使用JSON、明文等。
if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.
如果您在db中存储了转义html,则首先必须“释放”存储在db中的值,以便重新将其重新转义为不同的格式。
also see this perfect owasp article on xss prevention
也可以看看这篇关于xss预防的完美owasp文章
#2
17
Yes, because at some stage you'll want access to the original input entered. This is because...
是的,因为在某些阶段,您将希望访问输入的原始输入。这是因为……
- You never know how you want to display it - in JSON, in HTML, as an SMS?
- 你永远不知道如何显示它——JSON, HTML, SMS?
- You may need to show it back to the user as is.
- 您可能需要将其显示给用户。
I do see your point about never wanting HTML entered. What are you using to strip HTML tags? If it a regex, then look out for confused users who might type something like this...
我明白你的意思,不要输入HTML。你用什么来去除HTML标签?如果它是一个regex,那么请注意可能输入类似以下内容的混乱用户……
3<4 :->
They'll only get the 3
if it is a regex.
如果是正则表达式,他们只能得到3。
#3
4
-
Another elusive issue: Suppose you are entering a record with the string
R&B
in it's title. It will be stored asR&B
. And assume we have a search function which uses the SQL:另一个难以捉摸的问题是:假设您正在输入一个带有string R&B标题的记录。它将被存储为R&B。假设我们有一个使用SQL的搜索函数:
$query = $database->prepare('SELECT * FROM table WHERE title LIKE ?'); $query->execute(array($searchString.'%'));
Now if someone searches
R&B
, it won't match this row, as it is stored asR&B
. The situation is the same for equality, sorting etc.如果有人搜索R&B,它将不匹配这一行,因为它存储为R&B。平等、分类等情况也是一样的。
Of course, here we have the issue of not searching HTML tags, as
<span>
's will be matching when someone searches forspan
. This could be solved by delegating the search functionality to some external service like Solr, or by storing a version in a second field which is cleared of HTML tags, special characters and such (for full text search) similar to what @limscoder suggested.当然,这里我们有不搜索HTML标签的问题,因为当有人搜索span时's将会匹配。可以通过将搜索功能委托给Solr之类的外部服务来解决这个问题,或者将一个版本存储在第二个字段中,该字段不包含HTML标记、特殊字符等(对于全文搜索),类似于@limscoder的建议。
-
One day you may be exposing your data via an API or something, and your API users may assume it un-escaped.
有一天,您可能通过API或其他方式公开您的数据,您的API用户可能会认为它没有转义。
-
A few months later, a new team member joins. As a well trained developer, he always uses html escaping, now only to see everything is double-escaped (e.g. there are titles showing up like
He said "nuff"
instead ofHe said "nuff"
).几个月后,一个新的团队成员加入了。作为一名训练有素的开发人员,他总是使用html转义,现在只看到所有东西都是双转义的(例如,出现了一些标题,就像他说的“nuff;”;而不是他说的“nuff”。
-
Quote style of
htmlspecialchars()
(e.g.ENT_QUOTES
,ENT_COMPAT
etc) is going to bite you, if you are using anything other than the default one and forget to use the same quoting style in both storing/outputting.htmlspecialchars()的引号样式(例如ENT_QUOTES、ENT_COMPAT等)会让您吃不消,如果您正在使用除默认值之外的任何东西,而忘记在存储/输出中使用相同的引号样式的话。
A similar issue happens when you use
htmlentities()
to store, andhtmlspecialchars()
to output, or vice versa (with corresponding counter-functions). Your HTML will be polluted withÜ
s,Ç
s etc.当您使用htmlentities()存储和htmlspecialchars()输出时,也会发生类似的问题,反之亦然(具有相应的反函数)。你的HTML将会被污染。
These are more prone to be abused if there are multiple developers working on the same codebase.
如果有多个开发人员在同一代码基上工作,则更容易滥用这些代码。
#4
3
I usually store both versions of the text. The escaped/formatted text is used when a normal page request is made to avoid the overhead of escaping/formatting every time. The original/raw text is used when a user needs to edit an existing entry, and the escaping/formatting only occurs when the text is created or changed. This strategy works great unless you have tight storage space constraints, since you will be duplicating data.
我通常存储两个版本的文本。当正常的页面请求被用来避免每次的转义/格式化的开销时,就会使用转义/格式化的文本。当用户需要编辑现有条目时使用原始/原始文本,转义/格式化只在创建或更改文本时发生。这种策略非常有效,除非您有严格的存储空间限制,因为您将复制数据。