python - Multiple REST API calls on 1m data entries using Databricks + scala? - Stack Overflow

IT技术

更新时间：2025-03-070

admin管理员组
文章数量:1279178

I am trying to get an API call to get all the buildings in LA county. The website for the dataset is here

The county has 3 million buildings I've filtered buildings to 1 million-ish. You can look at my QUERY_PARAMS in the code.

I've tried using python but without surprise, retrieving 1 million data points still takes up a long time.

From the ESRI developer website, I understand that 1 single API call is limited to 10,000 results. However, because of my problem, I need to retrieve all 1 million buildings.

Here is my code so far, even after using async functions it still takes about 10 minutes

import aiohttp
import asyncio
import nest_asyncio

nest_asyncio.apply()  # Required if running in Jupyter Notebook

# Base URL for the API query
BASE_URL = ";

# Parameters for the query
QUERY_PARAMS = {
    "where": "(HEIGHT < 33) AND UseType = 'RESIDENTIAL' AND SitusCity IN('LOS ANGELES CA','BEVERLY HILLS CA',  'PALMDALE')",
    "outFields": "*",
    "outSR": "4326",
    "f": "json",
    "resultRecordCount": 1000,  # Fetch 1000 records per request
}

async def fetch_total_count():
    """Fetch total number of matching records."""
    params = QUERY_PARAMS.copy()
    params["returnCountOnly"] = "true"

    async with aiohttp.ClientSession() as session:
        async with session.get(BASE_URL, params=params) as response:
            data = await response.json()
            return data.get("count", 0)  # Extract total count

async def fetch(session, offset):
    """Fetch a batch of records using pagination."""
    params = QUERY_PARAMS.copy()
    params["resultOffset"] = offset

    async with session.get(BASE_URL, params=params) as response:
        return await response.json()

async def main():
    """Fetch all records asynchronously with pagination."""
    all_data = []
    total_count = await fetch_total_count()
    print(f"Total Records to Retrieve: {total_count}")

    semaphore = asyncio.Semaphore(10)  # Limit concurrency to prevent API overload

    async with aiohttp.ClientSession() as session:
        async def bound_fetch(offset):
            async with semaphore:
                data = await fetch(session, offset)
                return data

        # Generate tasks for pagination
        tasks = [bound_fetch(offset) for offset in range(0, total_count, 1000)]
        results = await asyncio.gather(*tasks)

        for data in results:
            if "features" in data:
                all_data.extend(data["features"])

    print(f"Total Records Retrieved: {len(all_data)}")
    return all_data

# Run the async function
all_data = asyncio.run(main())

I've turned to Databricks + scala to speed up the data retrieval faster. But I'm brand new to big data computing. I'm slightly aware you need to "parallize" your API calls and combine them into one big dataframe?

Can someone provide me suggestions?

本文标签： pythonMultiple REST API calls on 1m data entries using DatabricksscalaStack Overflow

版权声明：本文标题：python - Multiple REST API calls on 1m data entries using Databricks + scala? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741298026a2370929.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Multiple REST API calls on 1m data entries using Databricks + scala? - Stack Overflow

更多相关文章