visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode? - Stack Overflow

IT技术

更新时间：2025-04-165

admin管理员组
文章数量:1394182

In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.

Spark magic conf starts with :

{
  "kernel_python_credentials" : {
    "username": "admin",
    "password": "abcd",
    "url": ":8443/gateway/cdp-proxy-api/livy",
    "auth": "Basic_Access"
  }
...
}

I want to work on vscode with Python scripts and no more notebooks.

How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?

I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :

import os
import yaml
from pyspark.sql import SparkSession


def main():

    spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
    spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
    spark.conf.set("spark.yarn.maxAttempts", 0)

    Class_1(...)

if __name__ == "__main__":
    main()

When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:

file:/var/projects/test/package/projects/DEV/parquet_folder/

/var/projects/test/package/ is where my python package is cloned

spark.read('projects/DEV/parquet_folder/') works on jupyter notebooks.

In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.

Spark magic conf starts with :

{
  "kernel_python_credentials" : {
    "username": "admin",
    "password": "abcd",
    "url": "https://test.x.knox.y.fr:8443/gateway/cdp-proxy-api/livy",
    "auth": "Basic_Access"
  }
...
}

I want to work on vscode with Python scripts and no more notebooks.

How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?

I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :

import os
import yaml
from pyspark.sql import SparkSession


def main():

    spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
    spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
    spark.conf.set("spark.yarn.maxAttempts", 0)

    Class_1(...)

if __name__ == "__main__":
    main()

When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:

file:/var/projects/test/package/projects/DEV/parquet_folder/

/var/projects/test/package/ is where my python package is cloned

spark.read('projects/DEV/parquet_folder/') works on jupyter notebooks.

Share Improve this question edited Mar 12 at 13:39 asked Mar 12 at 12:47 LJRB 1415 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

The path file:/var/projects/test/package/projects/DEV/parquet_folder/ appears to be incorrect. When using the file:// prefix, you should include three slashes (file:///) to indicate an absolute path.

file:///var/projects/test/package/projects/DEV/parquet_folder/

Otherwise just use the path without prefix /var/projects/test/package/projects/DEV/parquet_folder/

and let Spark look for the file/folder in the default filesystem (HDFS, s3, local filesystem) as configured in [core-site.xml](https://hadoop.apache./docs/stable/ hadoop-project-dist/hadoop-common/core-default.xml) (property fs.defaultFS, which is the local filesystem by default).

本文标签： visual studio codeHow to readwrite parquet on remote HDFS with pythonpyspark in VSCodeStack Overflow

版权声明：本文标题：visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744751741a2623221.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode? - Stack Overflow

1 Answer 1

更多相关文章