admin管理员组文章数量:1200977
I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.
Following is one of such record from the CSV:
A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:
+---+----------+------------------------+
|A |B |
+---+-----------------------------------+
|xxxxxxxxx |"['05-01', '06-30']" |
|yyyyyyyyy |"['04-01', '04-30']" |
|zzzzzzzzz |null |
+---+----------+------------------------+
I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.
Following is one of such record from the CSV:
A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:
+---+----------+------------------------+
|A |B |
+---+-----------------------------------+
|xxxxxxxxx |"['05-01', '06-30']" |
|yyyyyyyyy |"['04-01', '04-30']" |
|zzzzzzzzz |null |
+---+----------+------------------------+
Share
Improve this question
edited Jan 21 at 15:26
ashish.g
asked Jan 21 at 15:18
ashish.gashish.g
5881 gold badge12 silver badges29 bronze badges
2
- This question is similar to: Parsing .csv file using Java 8 Stream. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – talex Commented Jan 21 at 15:23
- The suggested question doesn't talk about escaping delimiters. I didn't ask about a library to parse CSV but to deal with uncanny delimiter inside the column value. – ashish.g Commented Jan 21 at 15:29
1 Answer
Reset to default 0The default values of delimiter
and quote
options allow you to parse given csv correctly:
scala> scala.io.Source.fromFile("source.csv").mkString
res2: String =
"A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
"
scala> val df = spark.read.option("header", "true").csv("source.csv")
df: org.apache.spark.sql.DataFrame = [A: string, B: string]
scala> df.show()
+---------+------------------+
| A| B|
+---------+------------------+
|xxxxxxxxx|['05-01', '06-30']|
|yyyyyyyyy|['04-01', '04-30']|
|zzzzzzzzz| NULL|
+---------+------------------+
scala>
NOTE that the value for B
does not have the double quotes around each value. Which is the correct interpretation of given csv content per csv format.
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
本文标签:
版权声明:本文标题:How to correctly read a CSV file while escaping delimiter comma placed within square brackets using Apache Spark and Scala? - St 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738622681a2103279.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论