admin管理员组文章数量:1302332
I am pretty new to databricks and pyspark. I am creating one dataframe by reading a csv file but i am not calling any action. But I am seeing two jobs are running. Can someone explain why.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("uber_data_analysis").getOrCreate()
df = spark.read.csv("/FileStore/tables/uber_data.csv", header = True, inferSchema = True)
I am pretty new to databricks and pyspark. I am creating one dataframe by reading a csv file but i am not calling any action. But I am seeing two jobs are running. Can someone explain why.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("uber_data_analysis").getOrCreate()
df = spark.read.csv("/FileStore/tables/uber_data.csv", header = True, inferSchema = True)
Share
Improve this question
edited Feb 11 at 19:54
Ged
18.1k8 gold badges47 silver badges103 bronze badges
asked Feb 10 at 18:21
Kshitish DasKshitish Das
711 silver badge7 bronze badges
4
- Did you check the Spark UI to see what these jobs are? Also, depending on your version, the number of tasks may vary. – Steven Commented Feb 11 at 8:51
- I'm not sure for the 2 tasks but one of them is definitely scanning the file to infer the schema. – Steven Commented Feb 11 at 8:52
- Can you restart the cluster and try one more time? It's weird – mjeday Commented Feb 11 at 14:30
- I got this question re-opened. Revised answer slightly as back home. – Ged Commented Feb 18 at 15:09
2 Answers
Reset to default 2The point is that the question is about the Databricks environment that I also use (here). It could well be that for HDP on-prem or Cloudera this optimization does not happen, but could be a configuration option of such environments. However, I got tired of setting up (Hive) metastores etc. for the plain-vanilla Spark stuff. So I cannot remember but we see some stuff alluding to that.
Below with both False
parameters we get 1 Job. Path checking, partitions, what not. An error is given if file cannot be found.
As soon as you request (inferSchema=True)
, then there will be an extra job, for that exact process up-front of an Action.
So:
- Reading the header to get column names and path checking etc.
- Reading the file for schema inference.
There will always be at least one job, which will verify some basic things like existence of source (file, folder, table, ...).
As soon as you try to read some catalog like (spark.read.table('hive_metastore.default.table1')inferSchema=True))
, then it won't need to "read the data" to figure out the schema as it's part of the catalog and there will be only one extra job.
If you're looking under the hood for a specific reason then update OP, otherwise in general this is not really something you should bother with unless it's affecting your job.
本文标签:
版权声明:本文标题:apache spark - two jobs are getting created by only creating a dataframe in databricks note book - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741698771a2393165.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论