在什么情况下我可以使用Dask而不是Apache Spark?

时间:2022-06-19 03:08:45

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame.

我目前正在使用Pandas和Spark进行数据分析。我发现Dask提供了并行化的NumPy数组和Pandas DataFrame。

Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.

Pandas在Python中进行数据分析非常简单直观。但由于系统内存有限,我发现在Pandas中处理多个更大的数据帧有困难。

Simple Answer:

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. ... Generally Dask is smaller and lighter weight than Spark.

Apache Spark是一个包含分布式计算,SQL查询,机器学习等在JVM上运行的全包框架,通常与Hadoop等其他大数据框架共同部署。 ...通常Dask比Spark更小,重量更轻。

I get to know below details from http://dask.pydata.org/en/latest/spark.html

我从http://dask.pydata.org/en/latest/spark.html了解下面的详细信息

  • Dask is light weighted
  • Dask重量轻

  • Dask is typically used on a single machine, but also runs well on a distributed cluster.
  • Dask通常在单个计算机上使用,但也可以在分布式群集上运行良好。

  • Dask to provides parallel arrays, dataframes, machine learning, and custom algorithms
  • Dask提供并行数组,数据帧,机器学习和自定义算法

  • Dask has an advantage for Python users because it is itself a Python library, so serialization and debugging when things go wrong happens more smoothly.
  • Dask对Python用户有一个优势,因为它本身就是一个Python库,因此当出现问题时进行序列化和调试会更顺利。

  • Dask gives up high-level understanding to allow users to express more complex parallel algorithms.
  • Dask放弃了高级别的理解,允许用户表达更复杂的并行算法。

  • Dask is lighter weight and is easier to integrate into existing code and hardware.
  • Dask重量更轻,更易于集成到现有代码和硬件中。

  • If you want a single project that does everything and you’re already on Big Data hardware then Spark is a safe bet
  • 如果你想要一个可以完成所有事情并且你已经在大数据硬件上的项目,那么Spark是一个安全的选择

  • Spark is typically used on small to medium sized cluster but also runs well on a single machine.
  • Spark通常用于中小型集群,但也可在单台机器上运行良好。

I understand more things about Dask from the below link https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

我从以下链接了解有关Dask的更多信息https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

  • If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.
  • 如果您在使用Pandas,NumPy或其他使用Python的计算时遇到内存问题,存储限制或单个计算机上的CPU边界,Dask可以帮助您扩展单个计算机上的所有核心,或者向外扩展在群集中的所有核心和内存上。

  • Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data
  • Dask在一台机器上运行良好,可以利用笔记本电脑上的所有内核并处理大于内存的数据

  • scales up resiliently and elastically on clusters with hundreds of nodes.
  • 在具有数百个节点的群集上弹性地弹性扩展。

  • Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.
  • Dask使用Python本地工作,具有不同格式和存储系统的数据,包括Hadoop分布式文件系统(HDFS)和Amazon S3。 Anaconda和Dask可以与您现有的企业Hadoop发行版一起使用,包括Cloudera CDH和Hortonworks HDP。

http://dask.pydata.org/en/latest/dataframe-overview.html

Limitations

Dask.DataFrame does not implement the entire Pandas interface. Users expecting this will be disappointed.Notably, dask.dataframe has the following limitations:

Dask.DataFrame不实现整个Pandas接口。期望这样的用户会感到失望。但是,dask.dataframe有以下限制:

  1. Setting a new index from an unsorted column is expensive
  2. 从未排序的列设置新索引非常昂贵

  3. Many operations, like groupby-apply and join on unsorted columns require setting the index, which as mentioned above, is expensive
  4. 许多操作,例如groupby-apply和join on unsorted columns,需要设置索引,如上所述,索引很昂贵

  5. The Pandas API is very large. Dask.dataframe does not attempt to implement many pandas features or any of the more exotic data structures like NDFrames
  6. Pandas API非常大。 Dask.dataframe不会尝试实现许多pandas功能或任何更奇特的数据结构,如NDFrame

Thanks to the Dask developers. It seems like very promising technology.

感谢Dask开发人员。这似乎是非常有前途的技术。

Overall I can understand Dask is simpler to use than spark. Dask is as flexible as Pandas with more power to compute with more cpu's parallely.

总的来说,我可以理解Dask比spark更容易使用。 Dask与Pandas一样灵活,具有更大的计算能力和更多的CPU。

I understand all the above facts about Dask.

我理解关于Dask的所有上述事实。

So, roughly how much amount of data(in terabyte) can be processed with Dask?

那么,使用Dask大致可以处理多少数据量(以TB为单位)?

1 个解决方案

#1


15  

you may want to read Dask comparison to Apache Spark

您可能想要阅读与Apache Spark的Dask比较

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. It was originally optimized for bulk data ingest and querying common in data engineering and business analytics but has since broadened out. Spark is typically used on small to medium sized cluster but also runs well on a single machine.

Apache Spark是一个包含分布式计算,SQL查询,机器学习等在JVM上运行的全包框架,通常与Hadoop等其他大数据框架共同部署。它最初针对数据工程和业务分析中常见的批量数据摄取和查询进行了优化,但后来扩展了。 Spark通常用于中小型集群,但也可在单台机器上运行良好。

Dask is a parallel programming library that combines with the Numeric Python ecosystem to provide parallel arrays, dataframes, machine learning, and custom algorithms. It is based on Python and the foundational C/Fortran stack. Dask was originally designed to complement other libraries with parallelism, particular for numeric computing and advanced analytics, but has since broadened out. Dask is typically used on a single machine, but also runs well on a distributed cluster.

Dask是一个并行编程库,它与Numeric Python生态系统相结合,提供并行数组,数据帧,机器学习和自定义算法。它基于Python和基础C / Fortran堆栈。 Dask最初设计用于补充具有并行性的其他库,特别是用于数值计算和高级分析,但后来扩展了。 Dask通常在单个计算机上使用,但也可以在分布式群集上运行良好。

Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and instead is intended to be used in conjunction with other libraries, particularly those in the numeric Python ecosystem.

通常,Dask比Spark更小,重量更轻。这意味着它具有更少的功能,而是旨在与其他库结合使用,尤其是数字Python生态系统中的库。

#1


15  

you may want to read Dask comparison to Apache Spark

您可能想要阅读与Apache Spark的Dask比较

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. It was originally optimized for bulk data ingest and querying common in data engineering and business analytics but has since broadened out. Spark is typically used on small to medium sized cluster but also runs well on a single machine.

Apache Spark是一个包含分布式计算,SQL查询,机器学习等在JVM上运行的全包框架,通常与Hadoop等其他大数据框架共同部署。它最初针对数据工程和业务分析中常见的批量数据摄取和查询进行了优化,但后来扩展了。 Spark通常用于中小型集群,但也可在单台机器上运行良好。

Dask is a parallel programming library that combines with the Numeric Python ecosystem to provide parallel arrays, dataframes, machine learning, and custom algorithms. It is based on Python and the foundational C/Fortran stack. Dask was originally designed to complement other libraries with parallelism, particular for numeric computing and advanced analytics, but has since broadened out. Dask is typically used on a single machine, but also runs well on a distributed cluster.

Dask是一个并行编程库,它与Numeric Python生态系统相结合,提供并行数组,数据帧,机器学习和自定义算法。它基于Python和基础C / Fortran堆栈。 Dask最初设计用于补充具有并行性的其他库,特别是用于数值计算和高级分析,但后来扩展了。 Dask通常在单个计算机上使用,但也可以在分布式群集上运行良好。

Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and instead is intended to be used in conjunction with other libraries, particularly those in the numeric Python ecosystem.

通常,Dask比Spark更小,重量更轻。这意味着它具有更少的功能,而是旨在与其他库结合使用,尤其是数字Python生态系统中的库。