data:image/s3,"s3://crabby-images/4812d/4812df4f4c989d958c17905d4b50787a57ba3370" alt="Mastering Apache Spark 2.x(Second Edition)"
Defining schemas manually
So first, we have to import some classes. Follow the code to do this:
import org.apache.spark.sql.types._
So let's define a schema for some CSV file. In order to create one, we can simply write the DataFrame from the previous section to HDFS (again using the Apache Spark Datasoure API):
washing_flat.write.csv("hdfs://localhost:9000/tmp/washing_flat.csv")
Let's double-check the contents of the directory in HDFS:
data:image/s3,"s3://crabby-images/bd8ae/bd8ae89e23bc80966b2308e63c077ab5ddf89a33" alt=""
Finally, double-check the content of one file:
data:image/s3,"s3://crabby-images/16e28/16e286c77289e117a5532be115f5f5bdc090da03" alt=""
So, we are fine; we've lost the schema information but the rest of the information is preserved. We can see the following if we use the DataSource API to load this CSV again:
This shows you that we've lost the schema information because all columns are identified as strings now and the column names are also lost. Now let's create the schema manually:
val schema = StructType(
StructField("_id",StringType,true)::
StructField("_rev",StringType,true)::
StructField("count",LongType,true)::
StructField("flowrate",LongType,true)::
StructField("fluidlevel",StringType,true)::
StructField("frequency",LongType,true)::
StructField("hardness",LongType,true)::
StructField("speed",LongType,true)::
StructField("temperature",LongType,true)::
StructField("ts",LongType,true)::
StructField("voltage",LongType,true)::
Nil)
If we now load rawRDD, we basically get a list of strings, one string per row:
data:image/s3,"s3://crabby-images/06bf7/06bf7d83b2f8a398d7a0a756393f73e2fac3e854" alt=""
Now we have to transform this rawRDD into a slightly more usable RDD containing the Row object by splitting the row strings and creating the respective Row objects. In addition, we convert to the appropriate data types where necessary:
data:image/s3,"s3://crabby-images/8c9da/8c9da2091a7713305d14c44a79962ed0a5b4143a" alt=""
Finally, we recreate our data frame object using the following code:
data:image/s3,"s3://crabby-images/e5745/e5745bd1cfb425156602f6848aa434f82687d698" alt=""
If we now print the schema, we notice that it is the same again:
data:image/s3,"s3://crabby-images/0aaa4/0aaa43e018feeff614d84f283bf59b313dcca86d" alt=""