Spark学习实例(Python):RDD、DataFrame、DataSet相互转换
在学习转换之前先了解以下它们的基本概念
-
RDD:弹性分布式数据集,是一个只读分区集合 DataFrame:以命名列方式组织的分布式数据集,概念上和关系型数据库的一张表一样 DataSet:分布式数据集合,Python暂时不支持
了解了基本的概念之后,接下来我们通过代码编写三种数据集的形成
RDD的形成
from pyspark.sql import SparkSession
if __name__ == __main__:
spark = SparkSession
.builder
.appName("rddData")
.master("local[*]")
.getOrCreate()
# 方式一:
data = [1, 2, 3, 4, 5]
rdd1 = spark.sparkContext.parallelize(data)
print(rdd1.collect())
# [1, 2, 3, 4, 5]
# 方式二:
rdd2 = spark.sparkContext.textFile("/home/llh/data/people.txt")
print(rdd2.collect())
# [Jack 27, Rose 24, Andy 32]
spark.stop()
DataFrame的形成
from pyspark.sql import SparkSession
if __name__ == __main__:
spark = SparkSession
.builder
.appName("rddDataFrame")
.master("local[*]")
.getOrCreate()
df = spark.read.text("/home/llh/data/people.txt")
df.show()
# +---+----+
# |age|name|
# +---+----+
# | 27|Jack|
# | 24|Rose|
# | 32|Andy|
# +---+----+
spark.stop()
RDD转成DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import Row
if __name__ == __main__:
spark = SparkSession
.builder
.appName("rddRDD")
.master("local[*]")
.getOrCreate()
data = [1, 2, 3]
rdd1 = spark.sparkContext.parallelize(data)
print(rdd1.collect())
# [1, 2, 3]
# rdd -> dataframe
rdd2 = rdd1.map(lambda x: Row(x))
df = spark.createDataFrame(rdd2, schema=[num])
df.show()
# +---+
# |num|
# +---+
# | 1 |
# | 2 |
# | 3 |
# +---+
spark.stop()
DataFrame转成RDD
from pyspark.sql import SparkSession
if __name__ == __main__:
spark = SparkSession
.builder
.appName("rddDataFrame")
.master("local[*]")
.getOrCreate()
df = spark.read.text("/home/llh/data/people.txt")
rdd = df.rdd
print(rdd.collect())
# [Row(value=Jack 27), Row(value=Rose 24), Row(value=Andy 32)]
spark.stop()
以上就是RDD与DataFrame形成与相互转换
Spark学习目录:
上一篇:
通过多线程提高代码的执行效率例子
下一篇:
网站都变成灰色,几行代码搞定!
