I am working with large datasets (10s of millions of records, at times, 100s of millions), and want to use a database program that links well with R. I am trying to decide between mysql and sqlite. The data is static, but there are lot of queries that I need to do.
我正在处理大型数据集(数百万条记录,有时是数百万条记录),并希望使用与R链接良好的数据库程序。我正在尝试在mysql和sqlite之间做出决定。数据是静态的,但我需要做很多查询。
In this link to sqlite help, it states that:
在此sqlite帮助的链接中,它声明:
"With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes."
“默认页面大小为1024字节,SQLite数据库的大小限制为2 TB(241字节)。即使它可以处理更大的数据库,SQLite也会将整个数据库存储在一个磁盘文件中,并且许多文件系统限制了最大值文件的大小要小于这个。所以如果你正在考虑这么大的数据库,你最好考虑使用一个客户端/服务器数据库引擎,它将内容分布在多个磁盘文件中,也可能跨多个卷。“
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests. I'm wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The description above seems to suggest that this might be the case, but my data is no where near 2TB.
我不确定这意味着什么。当我尝试使用mysql和sqlite时,似乎mysql速度更快,但我还没有构建非常严格的速度测试。我想知道,由于我的数据集大小,mysql对我来说是否比sqlite更好。上面的描述似乎表明可能是这种情况,但我的数据不在2TB附近。
There was a discussion on * that touched on this and referenced the same sqlite information page, but it didn't quite address this question.
有一个关于*的讨论涉及到这个并引用了相同的sqlite信息页面,但它并没有完全解决这个问题。
I'd appreciate any insights into understanding this constraint of maximum file size from the filesystem and how this could affect speed for indexing tables and running queries. This could really help me in my decision of which database to use for my analysis.
我很欣赏任何有关理解文件系统中最大文件大小限制的见解,以及这会如何影响索引表和运行查询的速度。这可以帮助我决定使用哪个数据库进行分析。
4 个解决方案
#1
6
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask
on Android), mysql is probably the way to go.
SQLite数据库引擎将整个数据库存储到一个文件中。对于非常大的文件,这可能不是非常有效(SQLite的限制是2TB,正如您在帮助中找到的那样)。此外,SQLite一次仅限于一个用户。如果您的应用程序是基于Web的,或者最终可能是多线程的(如Android上的AsyncTask),那么mysql可能就是这样。
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
就个人而言,既然你已经完成了测试并且mysql速度更快,那我就选择使用mysql。它将在未来更具可扩展性,并允许您做更多。
#2
4
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
我不确定这意味着什么。当我尝试使用mysql和sqlite时,似乎mysql速度更快,但我还没有构建非常严格的速度测试。
The short short version is:
简短的短版本是:
-
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
如果您的应用需要适合手机或其他嵌入式系统,请使用SQLite。这就是它的设计目标。
-
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
如果您的应用可能需要多个并发连接,请不要使用SQLite。使用PostgreSQL,MySQL和InnoDB等。
#3
3
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite
or sqldf
packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
似乎(在R中,至少),SQLite对于临时分析来说非常棒。使用RSQLite或sqldf软件包,可以轻松加载数据并开始使用。但是对于你会反复使用的数据,在我看来MySQL(或SQL Server)是要走的路,因为它在修改数据库方面提供了更多的功能(例如,添加或更改密钥) 。
#4
1
SQL if you are mainly using this as a web service. SQLite, if you want it to able to function offline.
SQL如果您主要将其用作Web服务。 SQLite,如果您希望它能够脱机运行。
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
SQLite通常要快得多,因为大多数(或所有)数据/索引都将缓存在内存中。但是,在SQLite的情况下。根据我的经验,如果数据分散在多个表,甚至多个SQLite数据库文件中。对于甚至数百万条记录(虽然我还有100万的记录),它比SQL更有效(补偿延迟/等)。但是,当记录在不同的表中拆分时,查询特定于此类表(dun查询所有表)。
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
一个例子是在简单游戏中使用的项目数据库。虽然这可能听起来不多,但是即使是变化,也会发出UID。因此,发电机很快就可以快速计算出超过一百万套具有变化的“统计数据”。然而,这主要是由于每组1000条记录在不同的表之间分配。 (因为我们主要通过其UID拉取记录)。虽然分裂的表现没有得到适当的测量。我们得到的查询比SQL快10倍(主要是由于网络延迟)。
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
虽然有趣,我们最终将数据库减少到几千个条目,项目[pre-fix] / [suf-fix]确定变化。 (就像暗黑破坏神一样,只是它被隐藏了)。事实证明,在一天结束时要快得多。
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)
不过,在我看来,我的案例主要是由于查询一个接一个排队(等待前面的查询)。但是,如果您能够同时对服务器执行多个连接/查询。从客户端开始,SQL的性能下降得到了更多的补偿。假设这些查询不会相互分支/交互(例如,如果得到结果查询这个,那么)
#1
6
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask
on Android), mysql is probably the way to go.
SQLite数据库引擎将整个数据库存储到一个文件中。对于非常大的文件,这可能不是非常有效(SQLite的限制是2TB,正如您在帮助中找到的那样)。此外,SQLite一次仅限于一个用户。如果您的应用程序是基于Web的,或者最终可能是多线程的(如Android上的AsyncTask),那么mysql可能就是这样。
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
就个人而言,既然你已经完成了测试并且mysql速度更快,那我就选择使用mysql。它将在未来更具可扩展性,并允许您做更多。
#2
4
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
我不确定这意味着什么。当我尝试使用mysql和sqlite时,似乎mysql速度更快,但我还没有构建非常严格的速度测试。
The short short version is:
简短的短版本是:
-
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
如果您的应用需要适合手机或其他嵌入式系统,请使用SQLite。这就是它的设计目标。
-
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
如果您的应用可能需要多个并发连接,请不要使用SQLite。使用PostgreSQL,MySQL和InnoDB等。
#3
3
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite
or sqldf
packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
似乎(在R中,至少),SQLite对于临时分析来说非常棒。使用RSQLite或sqldf软件包,可以轻松加载数据并开始使用。但是对于你会反复使用的数据,在我看来MySQL(或SQL Server)是要走的路,因为它在修改数据库方面提供了更多的功能(例如,添加或更改密钥) 。
#4
1
SQL if you are mainly using this as a web service. SQLite, if you want it to able to function offline.
SQL如果您主要将其用作Web服务。 SQLite,如果您希望它能够脱机运行。
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
SQLite通常要快得多,因为大多数(或所有)数据/索引都将缓存在内存中。但是,在SQLite的情况下。根据我的经验,如果数据分散在多个表,甚至多个SQLite数据库文件中。对于甚至数百万条记录(虽然我还有100万的记录),它比SQL更有效(补偿延迟/等)。但是,当记录在不同的表中拆分时,查询特定于此类表(dun查询所有表)。
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
一个例子是在简单游戏中使用的项目数据库。虽然这可能听起来不多,但是即使是变化,也会发出UID。因此,发电机很快就可以快速计算出超过一百万套具有变化的“统计数据”。然而,这主要是由于每组1000条记录在不同的表之间分配。 (因为我们主要通过其UID拉取记录)。虽然分裂的表现没有得到适当的测量。我们得到的查询比SQL快10倍(主要是由于网络延迟)。
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
虽然有趣,我们最终将数据库减少到几千个条目,项目[pre-fix] / [suf-fix]确定变化。 (就像暗黑破坏神一样,只是它被隐藏了)。事实证明,在一天结束时要快得多。
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)
不过,在我看来,我的案例主要是由于查询一个接一个排队(等待前面的查询)。但是,如果您能够同时对服务器执行多个连接/查询。从客户端开始,SQL的性能下降得到了更多的补偿。假设这些查询不会相互分支/交互(例如,如果得到结果查询这个,那么)