存储大量分析数据

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .

我通常使用SQL Server和C#来处理我所做的所有项目,但是我正在研究一个可能跨越数十亿行数据的项目,我觉得在SQL Server中这样做并不舒服。

The data I will be storing is

我将要存储的数据是

datetime
ipAddress
linkId
possibly other string related data

可能是其他字符串相关数据

I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)

我之前只处理过关系数据库,因此正在寻找关于哪种数据库技术最适合这种类型的数据存储的一些指导。可以扩展并以低成本实现的(与分片SQL Server相比)

I would then need to pull this data out based on linkId.

然后我需要根据linkId提取这些数据。

Also would I be able to do ordering within the query to the DB or would that be best done in the application?

我也可以在查询中对数据库进行排序,还是最好在应用程序中完成?

EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

编辑:它将基于云。因此,我正在研究SQL Azure,我已广泛使用它,但它只是在行数增加时才开始引发问题。

2 个解决方案

#1

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:

鉴于这需要基于云并且您使用.Net / C#,如果您真的只是谈论几个表(到目前为止只是所述的表和隐含的“链接”表 - LinkID的来源),因此可能不需要关系或某些其他RDBMS功能,然后一个选项是使用亚马逊的DynamoDB。 DynamoDB是AWS(Amazon Web Services)的一部分,是NoSQL数据库。开发甚至是推出项目的初始阶段,它们的低端免费等级更容易实现。截至2013-11-04,主要的DynamoDB页面指出:

AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity, and 10 Units of Read Capacity with Amazon DynamoDB.

AWS免费套餐包括100MB的存储空间,5个写入容量单位以及10个可读取容量的Amazon DynamoDB。

Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.

以下是一些文档:概述,如何使用.Net查询,以及一般.Net SDK。

BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

请注意:在考虑您认为可能需要多少费用时,请确保包含相关的AWS部分,例如网络使用情况等。

#2

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.

既然您正在寻找一般性指导,我觉得可以提供您过早被解雇的答案;-)。 Microsoft SQL Server绝对可以处理这种情况(通常意义上有这些字段和数十亿行的表)。我个人在一个拥有4个节点的数据仓库上工作,每个节点的主事实表都有1.2到15亿行(并且还在增长),并且对查询的响应足够快,尽管数据模型和索引的某些方面可能有做得更好。它是一个基于Web的应用程序,许多用户整天都在使用它(尽管一天中的某些时段比其他时段更难)。此外,该事实表比您描述的表宽得多,除非“可能其他字符串相关数据”相当大(但也有方法正确建模)。没错,免费的Express版本可能无法满足您的需求,但标准版可能会,并且它不是非常昂贵。企业有一个很好的功能来进行在线索引重建,但仅此一点可能无法保证许可证费用的大幅增加。

Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.

请记住,几乎没有描述您实际尝试使用此数据完成的内容,我很难说MS SQL Server肯定会满足您的需求。但是,鉴于您似乎完全基于您可能获得的大量行来排除它,我至少可以说明这种情况:良好的数据建模,良好的索引设计和定期索引维护,MS SQL Server绝对可以处理数十亿行。现在,它是否是您项目的最佳选择取决于您要做的事情,客户对维护的满意程度等。

Good luck :)

祝你好运 :)

EDIT:

When I said (above) that the queries came back "quickly enough", I meant anywhere from 1 to 90 seconds, depending on various factors. Keep in mind that these were not simple queries, and in my opinion, several improvements could be made to the data modeling and index strategy.

当我说(上面)查询“足够快”回来时,我的意思是1到90秒,具体取决于各种因素。请记住,这些不是简单的查询,在我看来,可以对数据建模和索引策略进行一些改进。

I intentionally left out the Table Partitioning feature not only because it is only in Enterprise Edition, but also because it is more often misunderstood and hence misused than understood and used properly. Table/Index partitioning in SQL Server is not a means of "sharding".

我故意省略了表分区功能,不仅因为它仅在企业版中,而且因为它经常被误解,因此被滥用而不是理解和正确使用。 SQL Server中的表/索引分区不是“分片”的方法。

I also did not mention Column Store indexes because they are only available in Enterprise Edition. However, for projects large enough to justify the cost, Column Store indexes are certainly worth investigating. They were introduced in SQL Server 2012 and came with the restriction that the table could not be updated once the Column Store index was created. You can get around that, to a degree, using Table Partitioning, but in SQL Server 2014 that restriction will be removed.

我也没有提到Column Store索引,因为它们仅在Enterprise Edition中可用。但是,对于足以证明成本合理的项目,Column Store索引当然值得研究。它们是在SQL Server 2012中引入的,并且限制了在创建Column Store索引后无法更新表。您可以在一定程度上使用表分区来解决这个问题,但在SQL Server 2014中,将删除限制。

#1