apache spark - two jobs are getting created by only creating a dataframe in databricks note book - Stack Overflow

IT技术

更新时间：2025-03-111

admin管理员组
文章数量:1302332

I am pretty new to databricks and pyspark. I am creating one dataframe by reading a csv file but i am not calling any action. But I am seeing two jobs are running. Can someone explain why.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("uber_data_analysis").getOrCreate()
df = spark.read.csv("/FileStore/tables/uber_data.csv", header = True, inferSchema = True)

I am pretty new to databricks and pyspark. I am creating one dataframe by reading a csv file but i am not calling any action. But I am seeing two jobs are running. Can someone explain why.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("uber_data_analysis").getOrCreate()
df = spark.read.csv("/FileStore/tables/uber_data.csv", header = True, inferSchema = True)

Share Improve this question edited Feb 11 at 19:54 Ged 18.1k8 gold badges47 silver badges103 bronze badges asked Feb 10 at 18:21 Kshitish Das 711 silver badge7 bronze badges

Did you check the Spark UI to see what these jobs are? Also, depending on your version, the number of tasks may vary. – Steven Commented Feb 11 at 8:51
I'm not sure for the 2 tasks but one of them is definitely scanning the file to infer the schema. – Steven Commented Feb 11 at 8:52
Can you restart the cluster and try one more time? It's weird – mjeday Commented Feb 11 at 14:30
I got this question re-opened. Revised answer slightly as back home. – Ged Commented Feb 18 at 15:09

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

The point is that the question is about the Databricks environment that I also use (here). It could well be that for HDP on-prem or Cloudera this optimization does not happen, but could be a configuration option of such environments. However, I got tired of setting up (Hive) metastores etc. for the plain-vanilla Spark stuff. So I cannot remember but we see some stuff alluding to that.

Below with both False parameters we get 1 Job. Path checking, partitions, what not. An error is given if file cannot be found.

As soon as you request (inferSchema=True), then there will be an extra job, for that exact process up-front of an Action.

So:

Reading the header to get column names and path checking etc.
Reading the file for schema inference.

There will always be at least one job, which will verify some basic things like existence of source (file, folder, table, ...).

As soon as you try to read some catalog like (spark.read.table('hive_metastore.default.table1')inferSchema=True)), then it won't need to "read the data" to figure out the schema as it's part of the catalog and there will be only one extra job.

If you're looking under the hood for a specific reason then update OP, otherwise in general this is not really something you should bother with unless it's affecting your job.

本文标签：

版权声明：本文标题：apache spark - two jobs are getting created by only creating a dataframe in databricks note book - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741698771a2393165.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

javascript - ESLint: Spacing around equal sign - Stack Overflow

IT技术

26分钟前

I'm using airbnb for my base ESLint, but I noticed it doesn't seem to tag whitespace violatio

How do I enable materialized views in Cassandra running in a Docker container? - Stack Overflow

IT技术

23分钟前

Please someone help me how to enable materialized view in cassandra docker image. I couldn't see a

javascript - .filter not a function error? - Stack Overflow

IT技术

21分钟前

Created a function that returns the sum of the two lowest positive integers from an array. My problem i

javascript - Using Google Maps API to dynamically embed a map based on an address - Need to use Google's data for the ma

IT技术

21分钟前

I am trying to embed a Google Map into a dynamic webpage. The only variable the map depends on is the a

javascript - Node-RED Sending multiple msg only returning first msg - Stack Overflow

IT技术

19分钟前

I'm trying to send multiple messages using NodeRed.I've tried two options: Trying to send usi

c# - Call JavaScript function while the asp:LinkButton been clicked - Stack Overflow

IT技术

17分钟前

Just wondering, could it be possible call a JavaScript under asp:LinkButton while it been clicked.Examp

Doing key combos with jQueryJavaScript - Stack Overflow

IT技术

15分钟前

I'm curious how i, with the following jQuery plugin code im writing at the bottom of this question

jquery - Javascript - To combine or not combine, that is the question - Stack Overflow

IT技术

14分钟前

Ok, so I know that it's obvious to bine all of a pages Javascript into a single external file for

sqlite - How do I synchronize two databases? - Stack Overflow

IT技术

13分钟前

I need to regularly synchronize two SQLite databases. One on a Raspberry Pi 4 running Node.jsNestJS (a

remote - Set timeout on wp_remote_post()

IT技术

11分钟前

I am trying to use wp_remote_post() like this:wp_remote_post( $url, array( 'blocking' => false, 'timeo

regex - Confusion about lookahead precedence - Stack Overflow

IT技术

10分钟前

I have following regex[,{:](?!s*"[ws,]+")And following test data{" hi ":"

typescript - Prisma ORM computed result field in NestJs NX mono repo structure - Stack Overflow

IT技术

9分钟前

In a NX mono repo setup where a NestJs service is being developed, I'm usingthe Prisma ORM to int

javascript - D3 Selecting all elements having a certain class or combination of classes - Stack Overflow

IT技术

8分钟前

I am using D3 and I want to select all elements on the page that have a certain class.I have tried: d

esp32 - How to initialize the SD Card on esp32cam OV5640 - Stack Overflow

IT技术

6分钟前

I'm using a esp32 CamPlus OV5640 Rev 1.2.1 (example) and cannot get the SD card to initialise.The

cloud - Archived Server for Reference - Unable to Navigate Passed Homepage

IT技术

5分钟前

I recently moved to a cloud-hosted WordPress instance from an outdated on-prem server. As reference, the on-prem server

javascript - Image height is reset to 0 after upgrading to Vue 3 - Stack Overflow

IT技术

5分钟前

I have the following image definition.Template:<img :src="logoSVG" height="150px&qu

javascript - Calling a java script function inside HTML - Stack Overflow

IT技术

4分钟前

I am trying to print something inside my html from a java script. I did this but it don't work:<

javascript - Getting the depth of a tree data structure in a simpler way - Stack Overflow

IT技术

2分钟前

I have a JSON-like hierarchy of JS objects in the following format:[{subs: [ ...other objects... ]},...

javascript - Styling input field of dynamic ID using getElementById - Stack Overflow

IT技术

2分钟前

I have some input fields that has its id's number changes dynamically.For example, the below code

javascript - How to Get Text Node with Dojo JS - Stack Overflow

IT技术

28秒前

I am trying to get the text from within a tag - but that tag has a nested node, too which I don't

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

apache spark - two jobs are getting created by only creating a dataframe in databricks note book - Stack Overflow

2 Answers 2

更多相关文章

javascript - ESLint: Spacing around equal sign - Stack Overflow

How do I enable materialized views in Cassandra running in a Docker container? - Stack Overflow

javascript - .filter not a function error? - Stack Overflow

javascript - Using Google Maps API to dynamically embed a map based on an address - Need to use Google&#39;s data for the ma

javascript - Node-RED Sending multiple msg only returning first msg - Stack Overflow

c# - Call JavaScript function while the asp:LinkButton been clicked - Stack Overflow

Doing key combos with jQueryJavaScript - Stack Overflow

jquery - Javascript - To combine or not combine, that is the question - Stack Overflow

sqlite - How do I synchronize two databases? - Stack Overflow

remote - Set timeout on wp_remote_post()

regex - Confusion about lookahead precedence - Stack Overflow

typescript - Prisma ORM computed result field in NestJs NX mono repo structure - Stack Overflow

javascript - D3 Selecting all elements having a certain class or combination of classes - Stack Overflow

esp32 - How to initialize the SD Card on esp32cam OV5640 - Stack Overflow

cloud - Archived Server for Reference - Unable to Navigate Passed Homepage

javascript - Image height is reset to 0 after upgrading to Vue 3 - Stack Overflow

javascript - Calling a java script function inside HTML - Stack Overflow

javascript - Getting the depth of a tree data structure in a simpler way - Stack Overflow

javascript - Styling input field of dynamic ID using getElementById - Stack Overflow

javascript - How to Get Text Node with Dojo JS - Stack Overflow

发表评论

推荐文章

javascript - how to pass multiple arguments to onSuccess function in Prototype? - Stack Overflow

javascript - Upload and send string as file via telegram bot - Stack Overflow

php - Multiple meta_key in one global $wpdb;

javascript - nodejs mssql return recordset - Stack Overflow

javascript - Scroll a div to a certain class name horizontally - Stack Overflow

热门文章

javascript - Correct way of binding click event to list item in jquery - Stack Overflow

javascript - managing common jquery script files in asp.net project - Stack Overflow

c++ - Casting a container of Derived to container of Base - Stack Overflow

html - Javascript code for remove image - Stack Overflow

javascript - How to calculate leap year in jalali calendar? - Stack Overflow

Pandas in python dbt model duckdb - Stack Overflow

javascript - How to use ScrollView with nested FlatList? - Stack Overflow

users - wp_insert_user error

javascript - How to get number of pages using Puppeteer? - Stack Overflow

comments - How can I fix wp_insert_comment failure when &#39;comment_content&quot; includes slanted apostrophe in Excel

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

node.js - Lambda@edge error handling with asyncawait (nodejs) - Stack Overflow

wp admin - Add HTML to custom post type edit page

javascript - How to Get Text Node with Dojo JS - Stack Overflow

Prettier plugin for formatting enums removes comments — how to preserve the original printer? - Stack Overflow

javascript - How to filter ExtJs GridPanelExtJs Store? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Using Google Maps API to dynamically embed a map based on an address - Need to use Google's data for the ma

comments - How can I fix wp_insert_comment failure when 'comment_content" includes slanted apostrophe in Excel