I've got a design decision to make and am looking for some best practice advice. I have a java program which needs to store a large number (few hundred a day) of floating point arrays in a MySQL database. The data is a fixed length Double
array of length 300. I can see three reasonable options:
我有一个设计决定要做,我正在寻找一些最好的实践建议。我有一个java程序,它需要在MySQL数据库中存储大量(每天几百个)浮点数组。数据是一个长度为300的固定长度的双数组。我可以看到三个合理的选择:
- Store the data as a BLOB.
- 将数据存储为BLOB。
- Serialize the data and store it as a VARCHAR.
- 序列化数据并将其存储为VARCHAR。
- Write the data to disk as a binary file and store a reference to it instead.
- 将数据作为二进制文件写入磁盘,并存储对它的引用。
I should also mention that this data will be read from and updated frequently.
我还应该提到,这些数据将经常被读取和更新。
I want to use a BLOB since that is what I have done in the past and it seems like the most efficient method (e.g., maintains fixed width & no need to convert to a comma separated string). However my coworker is insisting that we should serialize and use varchar for reasons which seem mostly dogmatic.
我想要使用BLOB,因为这是我过去所做的,它看起来是最有效的方法(例如,保持固定的宽度,不需要转换成逗号分隔的字符串)。然而,我的同事坚持认为我们应该序列化和使用varchar,原因似乎大多是教条主义的。
If one of these methods is better than the other, are the reasons Java or MySQL specific?
如果这些方法中有一种比另一种更好,那么Java或MySQL特有的原因是什么呢?
4 个解决方案
#1
8
Store as a BLOB like so (see code example below). I think this is probably better than using java serialization since java's builtin serialization will need 2427 bytes, and non-java applications will have a harder time dealing with the data. That is, should there ever be any non-java applications querying the database in the future.... if not then the builtin serialization is a few less lines.
像这样以BLOB存储(参见下面的代码示例)。我认为这可能比使用java序列化要好,因为java的内建序列化将需要2427字节,而非java应用程序处理数据将更加困难。也就是说,应该不会有任何非java应用程序查询数据库未来....如果不是,那么内建序列化的行数就会减少。
public static void storeInDB() throws IOException, SQLException {
double[] dubs = new double[300];
ByteArrayOutputStream bout = new ByteArrayOutputStream();
DataOutputStream dout = new DataOutputStream(bout);
for (double d : dubs) {
dout.writeDouble(d);
}
dout.close();
byte[] asBytes = bout.toByteArray();
PreparedStatement stmt = null; // however we normally get this...
stmt.setBytes(1, asBytes);
}
public static double[] readFromDB() throws IOException, SQLException {
ResultSet rs = null; // however we normally get this...
while (rs.next()) {
double[] dubs = new double[300];
byte[] asBytes = rs.getBytes("myDoubles");
ByteArrayInputStream bin = new ByteArrayInputStream(asBytes);
DataInputStream din = new DataInputStream(bin);
for (int i = 0; i < dubs.length; i++) {
dubs[i] = din.readDouble();
}
return dubs;
}
}
Edit: I'd hoped to use BINARY(2400), but MySQL says:
编辑:我希望使用二进制(2400),但是MySQL说:
mysql> create table t (a binary(2400)) ;
ERROR 1074 (42000): Column length too big for column 'a' (max = 255);
use BLOB or TEXT instead
#2
14
Is there a reason you don't create a child table so you can store one floating point value per row, instead of an array?
是否有理由不创建子表以便每一行存储一个浮点值,而不是数组?
Say you store a thousand arrays of 300 elements each per day. That's 300,000 rows per day, or 109.5 million per year. Nothing to sneeze at, but within the capabilities of MySQL or any other RDBMS.
假设您每天存储1000个包含300个元素的数组。也就是每天30万行,或者说每年10950万行。没有什么可轻视的,但是在MySQL或任何其他RDBMS的能力范围内。
Re your comments:
是你的评论:
Sure, if the order is significant you add another column for the order. Here's how I'd design the table:
当然,如果订单是重要的,您可以为订单添加另一列。这是我设计桌子的方法:
CREATE TABLE VectorData (
trial_id INT NOT NULL,
vector_no SMALLINT UNSIGNED NOT NULL,
order_no SMALLINT UNSIGNED NOT NULL,
element FLOAT NOT NULL,
PRIMARY KEY (trial_id, vector_no),
FOREIGN KEY (trial_id) REFERENCES Trials (trial_id)
);
-
Total space for a row of vector data: 300x(4+2+2+4) = 3600 bytes. Plus InnoDB record directory (internals stuff) of 16 bytes.
一列向量数据的总空间:300x(4+2+2+4) = 3600字节。加上InnoDB记录目录(内部内容)16字节。
-
Total space if you serialize a Java array of 300 floats = 1227 bytes?
如果序列化一个300 float = 1227字节的Java数组,那么总空间是多少?
So you save about 2400 bytes, or 67% of the space by storing the array. But suppose you have 100GB of space to store the database. Storing a serialized array allows you to store 87.5 million vectors, whereas the normalized design only allows you to store 29.8 million vectors.
因此,通过存储数组,可以节省2400字节,或67%的空间。但是假设您有100GB的空间来存储数据库。存储序列化数组允许存储8750万个向量,而规范化设计只允许存储2980万个向量。
You said you store a few hundred vectors per day, so you'll fill up that 100GB partition in only 81 years instead of 239 years.
你说你每天储存几百个向量,所以你只需要81年就能把100GB的空间填满而不是239年。
Re your comment: Performance of INSERT is an important issue, but you're only storing a few hundred vectors per day.
回复您的评论:INSERT的性能是一个重要的问题,但是您每天只能存储几百个向量。
Most MySQL applications can achieve hundreds or thousands of inserts per second without excessive wizardry.
大多数MySQL应用程序可以实现每秒数百或数千次插入,而不需要过多的技巧。
If you need optimal performance, here are some things to look into:
如果你需要最优性能,以下是一些需要考虑的事项:
- Explicit transactions
- 显式事务
- Multi-row INSERT syntax
- 多行插入语法
- INSERT DELAYED (if you still use MyISAM)
- 插入延迟(如果仍然使用MyISAM)
- LOAD DATA INFILE
- 数据加载INFILE
- ALTER TABLE DISABLE KEYS, do the inserts, ALTER TABLE ENABLE KEYS
- 改变表禁用键,做插入,改变表启用键
Search for the phrase "mysql inserts per second" on your favorite search engine to read many articles and blogs talking about this.
在你最喜欢的搜索引擎上搜索“mysql每秒插入”这句话,可以阅读很多关于这方面的文章和博客。
#3
3
If you just want to store the data as a binary dump of the Java array then, by all means, use a BLOB. Your friend may well be advising against this since you may want some non-Java program to use the information at some later date so binary dumps are probably a pain to interpret.
如果您只想将数据存储为Java数组的二进制转储,那么无论如何,请使用BLOB。您的朋友可能会建议您不要这样做,因为您可能希望某个非java程序在以后的某个日期使用这些信息,所以二进制转储可能很难解释。
With serialization to a VARCHAR, you know the data format and can easily read it with any application.
通过对VARCHAR的序列化,您知道数据格式,并且可以轻松地使用任何应用程序读取它。
Of course, if there's even the slightest chance that you'll want to manipulate or report on the individual floats, they should be stored in a database-friendly format. In other words, not a binary dump, not serialized, not a CSV column.
当然,如果您想要操纵或报告单个浮点数,那么它们应该以数据库友好的格式存储。换句话说,不是二进制转储,不是序列化,不是CSV列。
Store them as Codd intended, in third normal form.
按照Codd的要求,以第三种标准形式存储它们。
By the way, a few hundred 300-element floating point arrays each day is not a big database. Take it from someone who works on the mainframe with DB2, most DBMS' will easily handle that sort of volume. We collect tens of millions of rows every day into our application and it doesn't even break into a sweat.
顺便说一下,每天几百个300元素的浮点数组并不是一个大的数据库。从在主机上使用DB2的人那里获取信息,大多数DBMS将很容易地处理这类卷。我们每天在我们的应用程序中收集数千万行数据,它甚至都不会出问题。
#4
0
Using a database to store an one dimensional array is pain in the ass! Even more using a rdm where is no relation between the data stored. sorry but the best solution imho is use a file and just write the data the way u like. binary or as txt. Thus 300xsize of long or 300x1 line of txt is one array.
使用数据库来存储一维数组是非常痛苦的!更重要的是使用rdm,其中存储的数据之间没有关系。对不起,最好的解决方案是使用一个文件,按你喜欢的方式编写数据。二进制或txt。因此,长300xsize或txt的300x1 line是一个数组。
#1
8
Store as a BLOB like so (see code example below). I think this is probably better than using java serialization since java's builtin serialization will need 2427 bytes, and non-java applications will have a harder time dealing with the data. That is, should there ever be any non-java applications querying the database in the future.... if not then the builtin serialization is a few less lines.
像这样以BLOB存储(参见下面的代码示例)。我认为这可能比使用java序列化要好,因为java的内建序列化将需要2427字节,而非java应用程序处理数据将更加困难。也就是说,应该不会有任何非java应用程序查询数据库未来....如果不是,那么内建序列化的行数就会减少。
public static void storeInDB() throws IOException, SQLException {
double[] dubs = new double[300];
ByteArrayOutputStream bout = new ByteArrayOutputStream();
DataOutputStream dout = new DataOutputStream(bout);
for (double d : dubs) {
dout.writeDouble(d);
}
dout.close();
byte[] asBytes = bout.toByteArray();
PreparedStatement stmt = null; // however we normally get this...
stmt.setBytes(1, asBytes);
}
public static double[] readFromDB() throws IOException, SQLException {
ResultSet rs = null; // however we normally get this...
while (rs.next()) {
double[] dubs = new double[300];
byte[] asBytes = rs.getBytes("myDoubles");
ByteArrayInputStream bin = new ByteArrayInputStream(asBytes);
DataInputStream din = new DataInputStream(bin);
for (int i = 0; i < dubs.length; i++) {
dubs[i] = din.readDouble();
}
return dubs;
}
}
Edit: I'd hoped to use BINARY(2400), but MySQL says:
编辑:我希望使用二进制(2400),但是MySQL说:
mysql> create table t (a binary(2400)) ;
ERROR 1074 (42000): Column length too big for column 'a' (max = 255);
use BLOB or TEXT instead
#2
14
Is there a reason you don't create a child table so you can store one floating point value per row, instead of an array?
是否有理由不创建子表以便每一行存储一个浮点值,而不是数组?
Say you store a thousand arrays of 300 elements each per day. That's 300,000 rows per day, or 109.5 million per year. Nothing to sneeze at, but within the capabilities of MySQL or any other RDBMS.
假设您每天存储1000个包含300个元素的数组。也就是每天30万行,或者说每年10950万行。没有什么可轻视的,但是在MySQL或任何其他RDBMS的能力范围内。
Re your comments:
是你的评论:
Sure, if the order is significant you add another column for the order. Here's how I'd design the table:
当然,如果订单是重要的,您可以为订单添加另一列。这是我设计桌子的方法:
CREATE TABLE VectorData (
trial_id INT NOT NULL,
vector_no SMALLINT UNSIGNED NOT NULL,
order_no SMALLINT UNSIGNED NOT NULL,
element FLOAT NOT NULL,
PRIMARY KEY (trial_id, vector_no),
FOREIGN KEY (trial_id) REFERENCES Trials (trial_id)
);
-
Total space for a row of vector data: 300x(4+2+2+4) = 3600 bytes. Plus InnoDB record directory (internals stuff) of 16 bytes.
一列向量数据的总空间:300x(4+2+2+4) = 3600字节。加上InnoDB记录目录(内部内容)16字节。
-
Total space if you serialize a Java array of 300 floats = 1227 bytes?
如果序列化一个300 float = 1227字节的Java数组,那么总空间是多少?
So you save about 2400 bytes, or 67% of the space by storing the array. But suppose you have 100GB of space to store the database. Storing a serialized array allows you to store 87.5 million vectors, whereas the normalized design only allows you to store 29.8 million vectors.
因此,通过存储数组,可以节省2400字节,或67%的空间。但是假设您有100GB的空间来存储数据库。存储序列化数组允许存储8750万个向量,而规范化设计只允许存储2980万个向量。
You said you store a few hundred vectors per day, so you'll fill up that 100GB partition in only 81 years instead of 239 years.
你说你每天储存几百个向量,所以你只需要81年就能把100GB的空间填满而不是239年。
Re your comment: Performance of INSERT is an important issue, but you're only storing a few hundred vectors per day.
回复您的评论:INSERT的性能是一个重要的问题,但是您每天只能存储几百个向量。
Most MySQL applications can achieve hundreds or thousands of inserts per second without excessive wizardry.
大多数MySQL应用程序可以实现每秒数百或数千次插入,而不需要过多的技巧。
If you need optimal performance, here are some things to look into:
如果你需要最优性能,以下是一些需要考虑的事项:
- Explicit transactions
- 显式事务
- Multi-row INSERT syntax
- 多行插入语法
- INSERT DELAYED (if you still use MyISAM)
- 插入延迟(如果仍然使用MyISAM)
- LOAD DATA INFILE
- 数据加载INFILE
- ALTER TABLE DISABLE KEYS, do the inserts, ALTER TABLE ENABLE KEYS
- 改变表禁用键,做插入,改变表启用键
Search for the phrase "mysql inserts per second" on your favorite search engine to read many articles and blogs talking about this.
在你最喜欢的搜索引擎上搜索“mysql每秒插入”这句话,可以阅读很多关于这方面的文章和博客。
#3
3
If you just want to store the data as a binary dump of the Java array then, by all means, use a BLOB. Your friend may well be advising against this since you may want some non-Java program to use the information at some later date so binary dumps are probably a pain to interpret.
如果您只想将数据存储为Java数组的二进制转储,那么无论如何,请使用BLOB。您的朋友可能会建议您不要这样做,因为您可能希望某个非java程序在以后的某个日期使用这些信息,所以二进制转储可能很难解释。
With serialization to a VARCHAR, you know the data format and can easily read it with any application.
通过对VARCHAR的序列化,您知道数据格式,并且可以轻松地使用任何应用程序读取它。
Of course, if there's even the slightest chance that you'll want to manipulate or report on the individual floats, they should be stored in a database-friendly format. In other words, not a binary dump, not serialized, not a CSV column.
当然,如果您想要操纵或报告单个浮点数,那么它们应该以数据库友好的格式存储。换句话说,不是二进制转储,不是序列化,不是CSV列。
Store them as Codd intended, in third normal form.
按照Codd的要求,以第三种标准形式存储它们。
By the way, a few hundred 300-element floating point arrays each day is not a big database. Take it from someone who works on the mainframe with DB2, most DBMS' will easily handle that sort of volume. We collect tens of millions of rows every day into our application and it doesn't even break into a sweat.
顺便说一下,每天几百个300元素的浮点数组并不是一个大的数据库。从在主机上使用DB2的人那里获取信息,大多数DBMS将很容易地处理这类卷。我们每天在我们的应用程序中收集数千万行数据,它甚至都不会出问题。
#4
0
Using a database to store an one dimensional array is pain in the ass! Even more using a rdm where is no relation between the data stored. sorry but the best solution imho is use a file and just write the data the way u like. binary or as txt. Thus 300xsize of long or 300x1 line of txt is one array.
使用数据库来存储一维数组是非常痛苦的!更重要的是使用rdm,其中存储的数据之间没有关系。对不起,最好的解决方案是使用一个文件,按你喜欢的方式编写数据。二进制或txt。因此,长300xsize或txt的300x1 line是一个数组。