Spark学习实例(Python):RDD、DataFrame、DataSet相互转换
在学习转换之前先了解以下它们的基本概念
-
RDD:弹性分布式数据集,是一个只读分区集合 DataFrame:以命名列方式组织的分布式数据集,概念上和关系型数据库的一张表一样 DataSet:分布式数据集合,Python暂时不支持
了解了基本的概念之后,接下来我们通过代码编写三种数据集的形成
RDD的形成
from pyspark.sql import SparkSession if __name__ == __main__: spark = SparkSession .builder .appName("rddData") .master("local[*]") .getOrCreate() # 方式一: data = [1, 2, 3, 4, 5] rdd1 = spark.sparkContext.parallelize(data) print(rdd1.collect()) # [1, 2, 3, 4, 5] # 方式二: rdd2 = spark.sparkContext.textFile("/home/llh/data/people.txt") print(rdd2.collect()) # [Jack 27, Rose 24, Andy 32] spark.stop()
DataFrame的形成
from pyspark.sql import SparkSession if __name__ == __main__: spark = SparkSession .builder .appName("rddDataFrame") .master("local[*]") .getOrCreate() df = spark.read.text("/home/llh/data/people.txt") df.show() # +---+----+ # |age|name| # +---+----+ # | 27|Jack| # | 24|Rose| # | 32|Andy| # +---+----+ spark.stop()
RDD转成DataFrame
from pyspark.sql import SparkSession from pyspark.sql import Row if __name__ == __main__: spark = SparkSession .builder .appName("rddRDD") .master("local[*]") .getOrCreate() data = [1, 2, 3] rdd1 = spark.sparkContext.parallelize(data) print(rdd1.collect()) # [1, 2, 3] # rdd -> dataframe rdd2 = rdd1.map(lambda x: Row(x)) df = spark.createDataFrame(rdd2, schema=[num]) df.show() # +---+ # |num| # +---+ # | 1 | # | 2 | # | 3 | # +---+ spark.stop()
DataFrame转成RDD
from pyspark.sql import SparkSession if __name__ == __main__: spark = SparkSession .builder .appName("rddDataFrame") .master("local[*]") .getOrCreate() df = spark.read.text("/home/llh/data/people.txt") rdd = df.rdd print(rdd.collect()) # [Row(value=Jack 27), Row(value=Rose 24), Row(value=Andy 32)] spark.stop()
以上就是RDD与DataFrame形成与相互转换
Spark学习目录:
上一篇:
通过多线程提高代码的执行效率例子
下一篇:
网站都变成灰色,几行代码搞定!