admin管理员组

文章数量:1200977

I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.

Following is one of such record from the CSV:

A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,

The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:

+---+----------+------------------------+
|A             |B                       |
+---+-----------------------------------+
|xxxxxxxxx     |"['05-01', '06-30']"    |
|yyyyyyyyy     |"['04-01', '04-30']"    |
|zzzzzzzzz     |null                    |
+---+----------+------------------------+

I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.

Following is one of such record from the CSV:

A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,

The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:

+---+----------+------------------------+
|A             |B                       |
+---+-----------------------------------+
|xxxxxxxxx     |"['05-01', '06-30']"    |
|yyyyyyyyy     |"['04-01', '04-30']"    |
|zzzzzzzzz     |null                    |
+---+----------+------------------------+
Share Improve this question edited Jan 21 at 15:26 ashish.g asked Jan 21 at 15:18 ashish.gashish.g 5881 gold badge12 silver badges29 bronze badges 2
  • This question is similar to: Parsing .csv file using Java 8 Stream. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – talex Commented Jan 21 at 15:23
  • The suggested question doesn't talk about escaping delimiters. I didn't ask about a library to parse CSV but to deal with uncanny delimiter inside the column value. – ashish.g Commented Jan 21 at 15:29
Add a comment  | 

1 Answer 1

Reset to default 0

The default values of delimiter and quote options allow you to parse given csv correctly:

scala> scala.io.Source.fromFile("source.csv").mkString
res2: String =
"A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
"

scala> val df = spark.read.option("header", "true").csv("source.csv")
df: org.apache.spark.sql.DataFrame = [A: string, B: string]

scala> df.show()
+---------+------------------+
|        A|                 B|
+---------+------------------+
|xxxxxxxxx|['05-01', '06-30']|
|yyyyyyyyy|['04-01', '04-30']|
|zzzzzzzzz|              NULL|
+---------+------------------+

scala>

NOTE that the value for B does not have the double quotes around each value. Which is the correct interpretation of given csv content per csv format.

  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

    "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

本文标签: