admin管理员组文章数量:1394182
In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.
Spark magic conf starts with :
{
"kernel_python_credentials" : {
"username": "admin",
"password": "abcd",
"url": ":8443/gateway/cdp-proxy-api/livy",
"auth": "Basic_Access"
}
...
}
I want to work on vscode with Python scripts and no more notebooks.
How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?
I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :
import os
import yaml
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
spark.conf.set("spark.yarn.maxAttempts", 0)
Class_1(...)
if __name__ == "__main__":
main()
When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:
file:/var/projects/test/package/projects/DEV/parquet_folder/
/var/projects/test/package/ is where my python package is cloned
spark.read('projects/DEV/parquet_folder/')
works on jupyter notebooks.
In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.
Spark magic conf starts with :
{
"kernel_python_credentials" : {
"username": "admin",
"password": "abcd",
"url": "https://test.x.knox.y.fr:8443/gateway/cdp-proxy-api/livy",
"auth": "Basic_Access"
}
...
}
I want to work on vscode with Python scripts and no more notebooks.
How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?
I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :
import os
import yaml
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
spark.conf.set("spark.yarn.maxAttempts", 0)
Class_1(...)
if __name__ == "__main__":
main()
When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:
file:/var/projects/test/package/projects/DEV/parquet_folder/
/var/projects/test/package/ is where my python package is cloned
spark.read('projects/DEV/parquet_folder/')
works on jupyter notebooks.
1 Answer
Reset to default 0The path file:/var/projects/test/package/projects/DEV/parquet_folder/
appears to be incorrect. When using the file://
prefix, you should include three slashes (file:///
) to indicate an absolute path.
file:///var/projects/test/package/projects/DEV/parquet_folder/
Otherwise just use the path without prefix /var/projects/test/package/projects/DEV/parquet_folder/
and let Spark look for the file/folder in the default filesystem (HDFS, s3, local filesystem) as configured in [core-site.xml](https://hadoop.apache./docs/stable/
hadoop-project-dist/hadoop-common/core-default.xml) (property fs.defaultFS
, which is the local filesystem by default).
本文标签: visual studio codeHow to readwrite parquet on remote HDFS with pythonpyspark in VSCodeStack Overflow
版权声明:本文标题:visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744751741a2623221.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论