可以处理>5亿行的数据库

I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?

我正在寻找一个可以处理的数据库(在合理的时间在列上创建一个索引，并在不到3秒的时间内为选择的查询提供结果)。在低端机器(核心2 CPU 6600、4GB、64位系统、Windows VISTA)上，Postgresql或Msql会处理这么多行吗?

Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.

更新:问这个问题时，我正在寻找应该在低端机器上使用哪个数据库的信息，以便提供结果来选择where子句中指定的一个或两个字段的问题。没有连接。我需要创建索引(不能像mysql那样花费很长时间)，以便为我的select查询实现足够的性能。这台机器是一台用来做实验的电脑。

The table schema:

表模式:

 create table mapper {
        key VARCHAR(1000),
        attr1 VARCHAR (100),
        attr1 INT,
        attr2 INT,
        value VARCHAR (2000),
        PRIMARY KEY (key),
        INDEX (attr1), 
        INDEX (attr2)   
    }

9 个解决方案

#1

MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.

MSSQL可以很好地处理这么多行。查询时间完全依赖于比简单的行计数多得多的因素。

For example, it's going to depend on:

例如，它取决于:

how many joins those queries do
有多少人加入了这些查询?
how well your indexes are set up
你的索引设置得如何
how much ram is in the machine
机器里有多少内存
speed and number of processors
处理器的速度和数量。
type and spindle speed of hard drives
硬盘驱动器的类型和主轴速度
size of the row/amount of data returned in the query
查询中返回的行/数量的大小。
Network interface speed / latency
网络接口速度/延迟

It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)

有一个小(少于10,000行)的表非常容易，它需要几分钟来执行查询。例如，在具有512MB总ram的Atom处理器上使用许多连接、where子句中的函数和零索引。,)

It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.

要确保所有索引和外键关系都是良好的，您的查询经过了优化，以消除不必要的函数调用，并且只返回您实际需要的数据，这需要花费更多的工作。此外，您还需要快速硬件。

It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.

这一切都归结为您想要花多少钱，开发团队的质量，以及您正在处理的数据行的大小。

UPDATE Updating due to changes in the question.

由于问题中的变更而更新。

The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.

这里的信息量还不足以给出真实世界的答案。您只需要对它进行测试，并根据需要调整数据库设计和硬件。

For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.

例如，我可以很容易地在具有这些规格的机器上的一个表中有10亿行，并运行“从tableA (nolock)查询中的select top(1) id”，并以毫秒为单位获得答案。同样，您可以执行“select * from tablea”查询，这需要一段时间，因为尽管查询执行得很快，但跨线传输所有数据需要一段时间。

Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.

重点是，你必须测试。也就是说，设置服务器、创建一些表并填充它们。然后，必须进行性能调优，才能正确地查询和索引。作为性能调优的一部分，您将不仅揭示查询需要如何重构，而且还将揭示需要根据锁和等待类型替换机器的哪些部分(例如:磁盘、更多ram、cpu等)。

I'd highly recommend you hire (or contract) one or two DBAs to do this for you.

我强烈建议你雇佣一两个dba(或合同)为你做这件事。

#2

Most databases can handle this, it's about what you are going to do with this data and how you do it. Lots of RAM will help.

大多数数据库都能处理这个问题，这关系到您将如何处理这些数据以及如何处理这些数据。大量的内存会有帮助。

I would start with PostgreSQL, it's for free and has no limits on RAM (unlike SQL Server Express) and no potential problems with licences (too many processors, etc.). But it's also my work :)

我将从PostgreSQL开始，它是免费的，对RAM没有限制(不像SQL Server Express)，也没有许可证的潜在问题(太多的处理器等等)。但这也是我的工作:

#3

Pretty much every non-stupid database can handle a billion rows today easily. 500 million is doable even on 32 bit systems (albeit 64 bit really helps).

几乎所有不笨的数据库现在都可以轻松地处理10亿行。即使在32位系统上，5亿也是可行的(尽管64位确实有帮助)。

The main problem is:

主要问题是:

You need to have enough RAM. How much is enough depends on your queries.
你需要有足够的内存。多少是足够的取决于你的查询。
You need to have a good enough disc subsystem. This pretty much means if you want to do large selects, then a single platter for everything is totally out of the question. Many spindles (or a SSD) are needed to handle the IO load.
你需要有一个足够好的磁盘子系统。这就意味着如果你想要进行大的选择，那么你完全不可能为所有的事情都准备一个盘子。需要许多主轴(或SSD)来处理IO负载。

Both Postgres as well as Mysql can easily handle 500 million rows. On proper hardware.

无论是Postgres还是Mysql，都可以轻松处理5亿行。在适当的硬件。

#4

What you want to look at is the table size limit the database software imposes. For example, as of this writing, MySQL InnoDB has a limit of 64 TB per table, while PostgreSQL has a limit of 32 TB per table; neither limits the number of rows per table. If correctly configured, these database systems should not have trouble handling tens or hundreds of billions of rows (if each row is small enough), let alone 500 million rows.

您要查看的是数据库软件强加的表大小限制。例如，在撰写本文时，MySQL InnoDB的每个表限制为64 TB，而PostgreSQL的每个表限制为32 TB;不限制每个表的行数。如果配置正确，这些数据库系统应该不会遇到处理数百亿或数千亿行的麻烦(如果每一行足够小)，更不用说处理5亿行了。

For best performance handling extremely large amounts of data, you should have sufficient disk space and good disk performance—which can be achieved with disks in an appropriate RAID—and large amounts of memory coupled with a fast processor(s) (ideally server-grade Intel Xeon or AMD Opteron processors). Needless to say, you'll also need to make sure your database system is configured for optimal performance and that your tables are indexed properly.

为了最好地处理大量数据，您应该有足够的磁盘空间和良好的磁盘性能——这可以通过适当的磁盘raid实现——以及大量的内存和快速处理器(理想情况下是服务器级的Intel Xeon或AMD Opteron处理器)。不用说，您还需要确保您的数据库系统被配置为最佳性能，并且您的表被正确地索引。

#5

The following article discusses the import and use of a 16 billion row table in Microsoft SQL. http://sqlmag.com/t-sql/adventures-big-data-how-import-16-billion-rows-single-table.

下面的文章讨论了在Microsoft SQL中160亿行表的导入和使用。http://sqlmag.com/t - sql/adventures -大-数据-如何导入表-单- 160亿行。

From the article:

从这篇文章:

Here are some distilled tips from my experience:

以下是我的一些经验总结:

The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some point, it becomes too slow to be practical. If you want to export your table to the smallest possible file, make it native format. This works best with tables containing mostly numeric columns because they’re more compactly represented in binary fields than character data. If all your data is alphanumeric, you won’t gain much by exporting it in native format. Not allowing nulls in the numeric fields can further compact the data. If you allow a field to be nullable, the field’s binary representation will contain a 1-byte prefix indicating how many bytes of data will follow. You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any reference to this on MSDN or the Internet. If your table consists of more than 2,147,483,647 records, you’ll have to export it in chunks or write your own export routine. Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original table size before completion. When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many records to commit at a time. If you don’t include this parameter, your entire file is imported as a single transaction, which requires a lot of log space. The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK INSERT statement with the ORDER parameter.

在具有已定义聚集索引的表中，数据越多，将未排序记录导入其中的速度就越慢。在某些时候，它变得太慢而不实际。如果要将表导出为最小的文件，请将其设置为本机格式。这对于大多数包含数字列的表最有效，因为它们在二进制字段中更紧密地表示，而不是字符数据。如果所有的数据都是字母数字，那么以本机格式导出数据不会有什么好处。在数字字段中不允许null可以进一步压缩数据。如果允许字段为空，则该字段的二进制表示将包含一个1字节的前缀，指示接下来的数据字节数。不能对超过2,147,483,647条记录使用BCP，因为BCP计数器变量是一个4字节的整数。我在MSDN或互联网上找不到任何参考资料。如果您的表包含超过2,147,483,647条记录，那么您将不得不以块的形式导出它，或者编写自己的导出例程。在预填充的表上定义聚集索引需要大量的磁盘空间。在我的测试中，我的日志在完成之前膨胀到原来表格大小的10倍。当使用BULK INSERT语句导入大量记录时，包括BATCHSIZE参数并指定一次提交多少条记录。如果不包含此参数，则将整个文件作为单个事务导入，这需要大量日志空间。使用聚集索引将数据放入表的最快方法是首先预发布数据。然后可以使用带有ORDER参数的BULK INSERT语句导入它。

Even that is small compared to the multi-petabyte Nasdaq OMX database, which houses tens of petabytes (thousands of terabytes) and trillions of rows on SQL Server.

即使与多pb的纳斯达克OMX数据库相比，这也算是小巫见大巫了。

#6

Have you checked out Cassandra? http://cassandra.apache.org/

你检查过卡桑德拉了吗?http://cassandra.apache.org/

#7

As mentioned pretty much all DB's today can handle this situation - what you want to concentrate on is your disk i/o subsystem. You need to configure a RAID 0 or RAID 0+1 situation throwing as many spindles to the problem as you can. Also, divide up your Log/Temp/Data logical drives for performance.

正如前面提到的，现在几乎所有的DB都可以处理这种情况——您需要关注的是您的磁盘i/o子系统。您需要配置raid0或raid0 +1情况，向问题抛出尽可能多的纺锤。此外，还可以划分日志/临时/数据逻辑驱动器的性能。

For example, let say you have 12 drives - in your RAID controller I'd create 3 RAID 0 partitions of 4 drives each. In Windows (let's say) format each group as a logical drive (G,H,I) - now when configuring SQLServer (let's say) assign the tempdb to G, the Log files to H and the data files to I.

例如，假设您有12个驱动器——在RAID控制器中，我将创建3个RAID 0分区，每个分区4个驱动器。在Windows(假设)中，将每个组格式化为逻辑驱动器(G,H,I)——现在配置SQLServer(假设)时，将tempdb分配给G，将日志文件分配给H，将数据文件分配给I。

#8

I don't have much input on which is the best system to use, but perhaps this tip could help you get some of the speed you're looking for.

我没有太多的输入，这是最好的系统，但也许这个技巧可以帮助你获得一些你想要的速度。

If you're going to be doing exact matches of long varchar strings, especially ones that are longer than allowed for an index, you can do a sort of pre-calculated hash:

如果你要做的是长varchar字符串的精确匹配，特别是那些比索引允许的长，你可以做一种预先计算的哈希:

CREATE TABLE BigStrings (
   BigStringID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
   Value varchar(6000) NOT NULL,
   Chk AS (CHECKSUM(Value))
);
CREATE NONCLUSTERED INDEX IX_BigStrings_Chk ON BigStrings(Chk);

--Load 500 million rows in BigStrings

DECLARE @S varchar(6000);
SET @S = '6000-character-long string here';

-- nasty, slow table scan:
SELECT * FROM BigStrings WHERE Value = @S

-- super fast nonclustered seek followed by very fast clustered index range seek:
SELECT * FROM BigStrings WHERE Value = @S AND Chk = CHECKSUM(@S)

This won't help you if you aren't doing exact matches, but in that case you might look into full-text indexing. This will really change the speed of lookups on a 500-million-row table.

如果不进行精确匹配，这对您没有帮助，但是在这种情况下，您可能需要查找全文索引。这将真正改变在一个5亿行的表上查找的速度。

#9

I need to create indices (that does not take ages like on mysql) to achieve sufficient performance for my select queries

I'm not sure what you mean by "creating" indexes. That's normally a one-time thing. Now, it's typical when loading a huge amount of data as you might do, to drop the indexes, load your data, and then add the indexes back, so the data load is very fast. Then as you make changes to the database, the indexes would be updated, but they don't necessarily need to be created each time your query runs.

我不确定您所说的“创建”索引是什么意思。这通常是一次性的。现在，当像您可能做的那样加载大量数据时，通常会删除索引、加载数据，然后将索引添加回来，因此数据加载速度非常快。然后，当您对数据库进行更改时，将更新索引，但并不需要每次查询运行时都创建索引。

That said, databases do have query optimization engines where they will analyze your query and determine the best plan to retrieve the data, and see how to join the tables (not relevant in your scenario), and what indexes are available, obviously you'd want to avoid a full table scan, so performance tuning, and reviewing the query plan is important, as others have already pointed out.

说,数据库查询优化引擎,他们将分析您的查询和确定最佳计划来检索数据,看看如何加入表(不相关的在你的场景中),而索引是可用的,显然你想避免全表扫描,所以性能调优,并回顾查询计划很重要,因为别人已经指出。

The point above about a checksum looks interesting, and that could even be an index on attr1 in the same table.

上面关于校验和的点看起来很有趣，甚至可以是同一个表中的attr1上的索引。

#1