Apache Spark 2.0.0发布，APIs更新

2016-7-28 22:33| 发布者: joejoe0332| 查看: 956| 评论: 0|原作者: oschina|来自: oschina

摘要: Apache Spark 2.0.0 发布了，Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存 ...

Apache Spark 2.0.0 发布了，Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。

该版本主要更新APIs，支持SQL 2003，支持R UDF ，增强其性能。300个开发者贡献了2500补丁程序。

Apache Spark 2.0.0 APIs更新记录如下：

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
A new, streamlined configuration API for SparkSession
Simpler, more performant accumulator API
A new, improved Aggregator API for typed aggregation in Datasets

Apache Spark 2.0.0 SQL更新记录如下：

A native SQL parser that supports both ANSI-SQL as well as Hive QL
Native DDL command implementations
Subquery support, including

Uncorrelated Scalar Subqueries
Correlated Scalar Subqueries
NOT IN predicate Subqueries (in WHERE/HAVING clauses)
IN predicate subqueries (in WHERE/HAVING clauses)
(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)

View canonicalization support

一些新特性：

Native CSV data source, based on Databricks’ spark-csv module
Off-heap memory management for both caching and runtime execution
Hive style bucketing support
Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.

性能增强：

Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.
Improved Parquet scan throughput through vectorization
Improved ORC performance
Many improvements in the Catalyst query optimizer for common workloads
Improved window function performance via native implementations for all window functions
Automatic file coalescing for native data sources

酷毙

雷人

鲜花

鸡蛋

漂亮

收藏分享邀请

上一篇：Swoole-1.8.8已发布，增加并发任务新特性下一篇：Apache Mesos 1.0.0发布，集群管理器

快毕业了，没工作经验，
找份工作好难啊？
赶紧去人才芯片公司磨练吧!!

帐号		自动登录	找回密码
密码			注册

Apache Spark 2.0.0发布，APIs更新

最新评论