spark读取、保存.csv文件、并指定编码格式
一、用spark实现读取csv文件
核心代码:
val spark = SparkSession .builder() .master("local[*]") .appName("app") .getOrCreate() //读取文件 //方式一: val srcDF = spark .read .format("csv") .option("header","true") .option("multiLine","true") .option("encoding","gbk") //utf-8 .load("file:///C:\1.csv") //方式二: val df = spark .read .option("header","true") .option("multiLine","true") .option("encoding","gbk") //utf-8 .csv("/user/hadoop/test.csv") spark.stop()
关键参数:
format:指定读取csv文件。
header:是否指定头部行作为schema。
multiLine:在单元格中可能因为字数多有换行,但是不指定这个参数,处理数据时可能会报错。指定这个参数为true,可以将换行的单元格合并为1行。
encoding:指定编码格式如gbk或utf-8
如下表对option里面的参数,进行介绍:
二、写出csv文件
核心代码:
resultDF.write.mode("Append").csv("C:\Users\Desktop\123") resultDF.writer.mode("overwrite").option("header","ture").option("encoding","utf-8").csv("/user/hadoop/data")
mode参数设置org.apache.spark.sql.DataFrameWriter源码:
/** * Specifies the behavior when data or table already exists. Options include: * <ul> * <li>`overwrite`: overwrite the existing data.</li> * <li>`append`: append the data.</li> * <li>`ignore`: ignore the operation (i.e. no-op).</li> * <li>`error` or `errorifexists`: default option, throw an exception at runtime.</li> * </ul> * * @since 1.4.0 */ def mode(saveMode: String): DataFrameWriter[T] = { this.mode = saveMode.toLowerCase(Locale.ROOT) match { case "overwrite" => SaveMode.Overwrite case "append" => SaveMode.Append case "ignore" => SaveMode.Ignore case "error" | "errorifexists" | "default" => SaveMode.ErrorIfExists case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " + "Accepted save modes are overwrite, append, ignore, error, errorifexists.") } this }
上一篇:
IDEA上Java项目控制台中文乱码