如何开始大数据分析

I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis.

我长期使用R，最近开始使用Python。使用传统的RDBMS系统进行数据仓库，以及R/Python进行数据处理，我觉得现在需要用大数据分析来弄脏我的手。

I'd like to know how to get started with Big Data crunching. - How to start simple with Map/Reduce and the use of Hadoop

我想知道如何开始处理大数据。-如何从简单的Map/Reduce和Hadoop的使用入手

How can I leverage my skills in R and Python to get started with Big Data analysis. Using the Python Disco project for example.
如何利用R和Python中的技能开始大数据分析。例如使用Python Disco项目。
Using the RHIPE package and finding toy datasets and problem areas.
使用RHIPE包并查找玩具数据集和问题区域。
Finding the right information to allow me to decide if I need to move to NoSQL from RDBMS type databases
找到正确的信息，以便决定是否需要从RDBMS类型数据库转移到NoSQL

All in all, I'd like to know how to start small and gradually build up my skills and know-how in Big Data Analysis.

总之，我想知道如何从小做起，逐渐积累大数据分析方面的技能和专业知识。

Thank you for your suggestions and recommendations. I apologize for the generic nature of this query, but I'm looking to gain more perspective regarding this topic.

谢谢你的建议和建议。我对这个查询的一般性质表示歉意，但是我希望能对这个主题有更多的了解。

Harsh
严厉的

2 个解决方案

#1

Using the Python Disco project for example.

例如使用Python Disco项目。

Good. Play with that.

好。玩。

Using the RHIPE package and finding toy datasets and problem areas.

使用RHIPE包并查找玩具数据集和问题区域。

Fine. Play with that, too.

很好。玩,太。

Don't sweat finding "big" datasets. Even small datasets present very interesting problems. Indeed, any dataset is a starting-off point.

不要费力去寻找“大”数据集。即使是很小的数据集也会出现非常有趣的问题。实际上，任何数据集都是一个起点。

I once built a small star-schema to analyze the $60M budget of an organization. The source data was in spreadsheets, and essentially incomprehensible. So I unloaded it into a star schema and wrote several analytical programs in Python to create simplified reports of the relevant numbers.

我曾经构建了一个小型的星型模式来分析一个组织6000万美元的预算。源数据在电子表格中，本质上是不可理解的。因此，我将它卸载到星型模式中，并用Python编写了几个分析程序，以创建相关数字的简化报告。

Finding the right information to allow me to decide if I need to move to NoSQL from RDBMS type databases

找到正确的信息，以便决定是否需要从RDBMS类型数据库转移到NoSQL

This is easy.

这是很容易的。

First, get a book on data warehousing (Ralph Kimball's The Data Warehouse Toolkit) for example.

首先，拿一本关于数据仓库的书(比如Ralph Kimball的数据仓库工具包)。

Second, study the "Star Schema" carefully -- particularly all the variants and special cases that Kimball explains (in depth)

其次，仔细研究“星型图式”——尤其是金博尔(Kimball)解释的所有变体和特殊情况(深入)

Third, realize the following: SQL is for Updates and Transactions.

第三，实现以下功能:SQL用于更新和事务。

When doing "analytical" processing (big or small) there's almost no update of any kind. SQL (and related normalization) don't really matter much any more.

当进行“分析”处理(大或小)时，几乎没有任何类型的更新。SQL(以及相关的规范化)不再重要。

Kimball's point (and others, too) is that most of your data warehouse is not in SQL, it's in simple Flat Files. A data mart (for ad-hoc, slice-and-dice analysis) may be in a relational database to permit easy, flexible processing with SQL.

Kimball的观点(以及其他观点)是，您的大多数数据仓库不是在SQL中，而是在简单的平面文件中。数据集市(用于临时的切片分析)可能位于关系数据库中，以便使用SQL进行简单、灵活的处理。

So the "decision" is trivial. If it's transactional ("OLTP") it must be in a Relational or OO DB. If it's analytical ("OLAP") it doesn't require SQL except for slice-and-dice analytics; and even then the DB is loaded from the official files as needed.

所以“决定”是微不足道的。如果它是事务性的(“OLTP”)，那么它必须位于关系或OO DB中。如果它是分析性的(“OLAP”)，它不需要SQL，除了切片和骰子分析;即使这样，DB也会根据需要从官方文件中加载。

#2

One thing you can consider is the DMelt (http://jwork.org/dmelt/) data analysis program. One notable feature is that it has hundreds of examples using the Python language, and a few books. The reason I was using it is that it runs on my Windows 10 (since it uses Java VM), plus it has very good graphics in 2D/3D which can be exported to the vector graphics format.

您可以考虑的一件事是DMelt (http://jwork.org/dmelt/)数据分析程序。一个值得注意的特性是，它有数百个使用Python语言的示例和一些书籍。我使用它的原因是它在我的Windows 10上运行(因为它使用Java VM)，而且它有非常好的2D/3D图形，可以导出为矢量图形格式。

#1