将base64编码数据存储为BLOB或TEXT数据类型

时间:2022-09-15 19:15:18

We have a MySQL InnoDB table holding ~10 columns of small base64 encoded javascript files and png (<2KB size) images base64 encoded as well.

我们有一个MySQL InnoDB表,包含~10列小base64编码的javascript文件和png(<2KB大小)图像base64编码。

There are few inserts and a lot of reads comparatively, however the output is being cached on a Memcached instance for some minutes to avoid subsequent reads.

相对较少的插入和大量读取,但是输出被缓存在Memcached实例上几分钟以避免后续读取。

As it is right now we are using BLOB for those columns, but I am wondering if there is an advantage in switching to TEXT datatype in terms of performance or snapshot backing up.

现在我们正在为这些列使用BLOB,但我想知道在性能或快照备份方面是否有切换到TEXT数据类型的优势。

My search digging indicates that BLOB and TEXT for my case are close to identical and since I do not know before-hand what type of data are actually going to be stored I went for BLOB.

我的搜索挖掘表明我的案例的BLOB和TEXT几乎相同,因为我不知道实际上将存储什么类型的数据我去了BLOB。

Do you have any pointers on the TEXT vs BLOB debate for this specific case?

对于这个具体案例,你对TEXT vs BLOB辩论有什么指示吗?

1 个解决方案

#1


26  

One shouldn't store Base64-encoded data in one's database...

Base64 is a means of representing arbitrary binary data using only printable text characters: it was designed for situations where one needs to transfer such binary data across a protocol or medium that can handle only printable-text (e.g. SMTP/email). It increases the data size (by 33%) and adds the computational cost of encoding/decoding, so it should be avoided unless absolutely necessary.

Base64是一种仅使用可打印文本字符表示任意二进制数据的方法:它是为需要通过只能处理可打印文本(例如SMTP /电子邮件)的协议或介质传输此类二进制数据的情况而设计的。它增加了数据大小(增加了33%)并增加了编码/解码的计算成本,因此除非绝对必要,否则应该避免。

By contrast, the whole point of BLOB columns is that they store raw binary strings. So just go ahead and store your stuff directly into your BLOB columns without first Base64-encoding them. Usually you'll want to store related metadata in other columns, such as file version/last modified date, media type, and (in the case of text files, such as JavaScript sources) character encoding. You might decide to use TEXT type columns for the text files, not only so that MySQL will natively track character encoding for you, but also so that it can transcode to alternative character sets and/or inspect/manipulate the text as may be required (now or in the future).

相比之下,BLOB列的重点是它们存储原始二进制字符串。因此,只需将您的东西直接存储到BLOB列中,而无需先对它们进行Base64编码。通常,您希望将相关元数据存储在其他列中,例如文件版本/上次修改日期,媒体类型和(如果是文本文件,如JavaScript源)字符编码。您可能决定对文本文件使用TEXT类型列,这不仅是为了让MySQL本身跟踪字符编码,而且还可以转换为替代字符集和/或根据需要检查/操作文本(现在或将来)。

The (erroneous) idea that SQL databases require printable-text encodings like Base64 for handling arbitrary binary data has been perpetuated by a large number of ill-informed tutorials. This idea appears to be seated in the mistaken belief that, because SQL comprises only printable-text in other contexts, it must surely require it for binary data too (at least for data transfer, if not for data storage). This is simply not true: SQL can convey binary data in a number of ways, including plain string literals (provided that they are properly quoted and escaped like any other string); of course, the preferred way to pass data (of any type) to your database is through parameterised queries, and parameters can just as easily contain binary data as they can anything else.

SQL数据库需要像Base64这样的可打印文本编码来处理任意二进制数据的(错误的)观念已经被大量不明智的教程所延续。这个想法似乎是错误地认为,因为SQL在其他环境中只包含可打印文本,所以它必然也需要二进制数据(至少对于数据传输,如果不是用于数据存储)。事实并非如此:SQL可以通过多种方式传递二进制数据,包括纯字符串文字(假设它们被正确引用并像任何其他字符串一样进行转义);当然,将数据(任何类型)传递到数据库的首选方法是通过参数化查询,参数可以像其他任何东西一样容易地包含二进制数据。

For what it's worth, I usually altogether avoid storing items like this in the RDBMS and prefer instead to use those highly optimised file storage databases known as filesystems—but that's another matter altogether.

对于它的价值,我通常完全避免在RDBMS中存储这样的项目而宁愿使用那些被称为文件系统的高度优化的文件存储数据库 - 但这完全是另一回事。

...unless it's cached for performance reasons...

The only situation in which there might be some benefit from storing Base64-encoded data is where data is frequently retrieved from the database and transmitted across a protocol that requires that encoding—in which case, storing the Base64-encoded representation would save from having to perform the encoding operation on the otherwise raw data upon every fetch.

存储Base64编码数据可能带来一些好处的唯一情况是数据经常从数据库中检索并通过需要编码的协议传输 - 在这种情况下,存储Base64编码的表示将节省每次获取时对原始数据执行编码操作。

However, note in this sense that the Base64-encoded storage is merely acting as a cache, much like one might store denormalised data for performance reasons.

但是,请注意,在这种意义上,Base64编码的存储仅仅充当缓存,就像出于性能原因可能存储非规范化数据一样。

...in which case it should be TEXT not BLOB

As alluded to above, the difference between TEXT and BLOB really comes down to the fact that TEXT columns are stored together with text-specific metadata (such as character encoding and collation), whereas BLOB columns are not. This additional metadata enables MySQL to transcode characters between storage and connection character sets (where appropriate) and perform fancy character equivalence/ordering.

如上所述,TEX​​T和BLOB之间的区别实际上归结为TEXT列与特定于文本的元数据(例如字符编码和排序规则)一起存储,而BLOB列则不是。这种额外的元数据使MySQL能够在存储和连接字符集(适当的地方)之间转码字符,并执行花哨的字符等效/排序。

Generally speaking: if two clients working in different character sets should see the same bytes, then you want a BLOB column; if they should see the same characters then you want a TEXT column.

一般来说:如果两个工作在不同字符集的客户端应该看到相同的字节,那么你需要一个BLOB列;如果他们应该看到相同的字符,那么你想要一个TEXT列。

With Base64, those two clients must ultimately find that the data decodes to the same bytes; but they should see that the encoded data has the same characters. For example, suppose one wishes to insert the Base64-encoding of 'Hello world!' (which is 'SGVsbG8gd29ybGQh'). If the inserting application is working in the UTF-8 character set, then it will send the byte sequence 0x53475673624738676432397962475168 to the database.

使用Base64,这两个客户端必须最终发现数据解码为相同的字节;但他们应该看到编码数据具有相同的字符。例如,假设有人希望插入“Hello world!”的Base64编码。 (即'SGVsbG8gd29ybGQh')。如果插入应用程序在UTF-8字符集中工作,则它将字节序列0x53475673624738676432397962475168发送到数据库。

  • if that byte sequence is stored in a BLOB column and later retrieved by an application that is working in UTF-16, the same bytes will be returned—which represent '升噳扇㡧搲㥹扇全' and not the desired Base64-encoded value; whereas

    如果该字节序列存储在BLOB列中并且稍后由在UTF-16中工作的应用程序检索,则将返回相同的字节 - 其代表“升噳扇㡧搲㥹扇全”而不是所需的Base64编码值;而

  • if that byte sequence is stored in a TEXT column and later retrieved by an application that is working in UTF-16, MySQL will transcode on-the-fly to return the byte sequence 0x0053004700560073006200470038006700640032003900790062004700510068—which represents the original Base64-encoded value 'SGVsbG8gd29ybGQh' as desired.

    如果该字节序列存储在TEXT列中,之后由运行在UTF-16中的应用程序检索,则MySQL将动态转码以返回字节序列0x0053004700560073006200470038006700640032003900790062004700510068-其代表原始Base64编码值'SGVsbG8gd29ybGQh'如预期的。

Of course, you could nevertheless use BLOB columns and track the character encoding in some other way—but that would just needlessly reinvent the wheel, with added maintenance complexity and risk of introducing unintentional errors.

当然,你可以使用BLOB列并以其他方式跟踪字符编码 - 但这会不必要地重新发明*,增加了维护的复杂性和引入无意错误的风险。

#1


26  

One shouldn't store Base64-encoded data in one's database...

Base64 is a means of representing arbitrary binary data using only printable text characters: it was designed for situations where one needs to transfer such binary data across a protocol or medium that can handle only printable-text (e.g. SMTP/email). It increases the data size (by 33%) and adds the computational cost of encoding/decoding, so it should be avoided unless absolutely necessary.

Base64是一种仅使用可打印文本字符表示任意二进制数据的方法:它是为需要通过只能处理可打印文本(例如SMTP /电子邮件)的协议或介质传输此类二进制数据的情况而设计的。它增加了数据大小(增加了33%)并增加了编码/解码的计算成本,因此除非绝对必要,否则应该避免。

By contrast, the whole point of BLOB columns is that they store raw binary strings. So just go ahead and store your stuff directly into your BLOB columns without first Base64-encoding them. Usually you'll want to store related metadata in other columns, such as file version/last modified date, media type, and (in the case of text files, such as JavaScript sources) character encoding. You might decide to use TEXT type columns for the text files, not only so that MySQL will natively track character encoding for you, but also so that it can transcode to alternative character sets and/or inspect/manipulate the text as may be required (now or in the future).

相比之下,BLOB列的重点是它们存储原始二进制字符串。因此,只需将您的东西直接存储到BLOB列中,而无需先对它们进行Base64编码。通常,您希望将相关元数据存储在其他列中,例如文件版本/上次修改日期,媒体类型和(如果是文本文件,如JavaScript源)字符编码。您可能决定对文本文件使用TEXT类型列,这不仅是为了让MySQL本身跟踪字符编码,而且还可以转换为替代字符集和/或根据需要检查/操作文本(现在或将来)。

The (erroneous) idea that SQL databases require printable-text encodings like Base64 for handling arbitrary binary data has been perpetuated by a large number of ill-informed tutorials. This idea appears to be seated in the mistaken belief that, because SQL comprises only printable-text in other contexts, it must surely require it for binary data too (at least for data transfer, if not for data storage). This is simply not true: SQL can convey binary data in a number of ways, including plain string literals (provided that they are properly quoted and escaped like any other string); of course, the preferred way to pass data (of any type) to your database is through parameterised queries, and parameters can just as easily contain binary data as they can anything else.

SQL数据库需要像Base64这样的可打印文本编码来处理任意二进制数据的(错误的)观念已经被大量不明智的教程所延续。这个想法似乎是错误地认为,因为SQL在其他环境中只包含可打印文本,所以它必然也需要二进制数据(至少对于数据传输,如果不是用于数据存储)。事实并非如此:SQL可以通过多种方式传递二进制数据,包括纯字符串文字(假设它们被正确引用并像任何其他字符串一样进行转义);当然,将数据(任何类型)传递到数据库的首选方法是通过参数化查询,参数可以像其他任何东西一样容易地包含二进制数据。

For what it's worth, I usually altogether avoid storing items like this in the RDBMS and prefer instead to use those highly optimised file storage databases known as filesystems—but that's another matter altogether.

对于它的价值,我通常完全避免在RDBMS中存储这样的项目而宁愿使用那些被称为文件系统的高度优化的文件存储数据库 - 但这完全是另一回事。

...unless it's cached for performance reasons...

The only situation in which there might be some benefit from storing Base64-encoded data is where data is frequently retrieved from the database and transmitted across a protocol that requires that encoding—in which case, storing the Base64-encoded representation would save from having to perform the encoding operation on the otherwise raw data upon every fetch.

存储Base64编码数据可能带来一些好处的唯一情况是数据经常从数据库中检索并通过需要编码的协议传输 - 在这种情况下,存储Base64编码的表示将节省每次获取时对原始数据执行编码操作。

However, note in this sense that the Base64-encoded storage is merely acting as a cache, much like one might store denormalised data for performance reasons.

但是,请注意,在这种意义上,Base64编码的存储仅仅充当缓存,就像出于性能原因可能存储非规范化数据一样。

...in which case it should be TEXT not BLOB

As alluded to above, the difference between TEXT and BLOB really comes down to the fact that TEXT columns are stored together with text-specific metadata (such as character encoding and collation), whereas BLOB columns are not. This additional metadata enables MySQL to transcode characters between storage and connection character sets (where appropriate) and perform fancy character equivalence/ordering.

如上所述,TEX​​T和BLOB之间的区别实际上归结为TEXT列与特定于文本的元数据(例如字符编码和排序规则)一起存储,而BLOB列则不是。这种额外的元数据使MySQL能够在存储和连接字符集(适当的地方)之间转码字符,并执行花哨的字符等效/排序。

Generally speaking: if two clients working in different character sets should see the same bytes, then you want a BLOB column; if they should see the same characters then you want a TEXT column.

一般来说:如果两个工作在不同字符集的客户端应该看到相同的字节,那么你需要一个BLOB列;如果他们应该看到相同的字符,那么你想要一个TEXT列。

With Base64, those two clients must ultimately find that the data decodes to the same bytes; but they should see that the encoded data has the same characters. For example, suppose one wishes to insert the Base64-encoding of 'Hello world!' (which is 'SGVsbG8gd29ybGQh'). If the inserting application is working in the UTF-8 character set, then it will send the byte sequence 0x53475673624738676432397962475168 to the database.

使用Base64,这两个客户端必须最终发现数据解码为相同的字节;但他们应该看到编码数据具有相同的字符。例如,假设有人希望插入“Hello world!”的Base64编码。 (即'SGVsbG8gd29ybGQh')。如果插入应用程序在UTF-8字符集中工作,则它将字节序列0x53475673624738676432397962475168发送到数据库。

  • if that byte sequence is stored in a BLOB column and later retrieved by an application that is working in UTF-16, the same bytes will be returned—which represent '升噳扇㡧搲㥹扇全' and not the desired Base64-encoded value; whereas

    如果该字节序列存储在BLOB列中并且稍后由在UTF-16中工作的应用程序检索,则将返回相同的字节 - 其代表“升噳扇㡧搲㥹扇全”而不是所需的Base64编码值;而

  • if that byte sequence is stored in a TEXT column and later retrieved by an application that is working in UTF-16, MySQL will transcode on-the-fly to return the byte sequence 0x0053004700560073006200470038006700640032003900790062004700510068—which represents the original Base64-encoded value 'SGVsbG8gd29ybGQh' as desired.

    如果该字节序列存储在TEXT列中,之后由运行在UTF-16中的应用程序检索,则MySQL将动态转码以返回字节序列0x0053004700560073006200470038006700640032003900790062004700510068-其代表原始Base64编码值'SGVsbG8gd29ybGQh'如预期的。

Of course, you could nevertheless use BLOB columns and track the character encoding in some other way—but that would just needlessly reinvent the wheel, with added maintenance complexity and risk of introducing unintentional errors.

当然,你可以使用BLOB列并以其他方式跟踪字符编码 - 但这会不必要地重新发明*,增加了维护的复杂性和引入无意错误的风险。