admin管理员组

文章数量:1313151

I can read files from Azure Storage into Pandas like this

import pandas as pd
from azure.identity import AzureCliCredential
    
credential = AzureCliCredential()
pd.read_csv(
    "abfs://my_container/my_file.csv",
    storage_options={'account_name': 'my_account', 'credential': credential}
)

Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?

I can read files from Azure Storage into Pandas like this

import pandas as pd
from azure.identity import AzureCliCredential
    
credential = AzureCliCredential()
pd.read_csv(
    "abfs://my_container/my_file.csv",
    storage_options={'account_name': 'my_account', 'credential': credential}
)

Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?

Share Improve this question edited Feb 2 at 18:58 qkfang 1,7871 silver badge20 bronze badges asked Jan 31 at 14:10 techtechtechtech 1902 silver badges10 bronze badges 1
  • If you read multiple files in the same session, the original filesystem instance with its credentials will be reused. Are you suggesting persisting the token between sessions? How long does it live? – mdurant Commented Jan 31 at 15:34
Add a comment  | 

1 Answer 1

Reset to default 2

Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?

I agree with mdurant's comment, that if you read multiple files within the same session, fsspec should reuse the original filesystem instance and its credentials.

When using AzureCliCredential, the token lifetime depends on the Azure AD configuration, and it typically lasts for 1 hour before expiring.

You can use the below code that explains to persist and reuse tokens across sessions by caching them to disk.

Code:

import json
import os
from datetime import datetime, timezone
import pandas as pd

from azure.core.credentials import AccessToken, TokenCredential
from azure.identity import AzureCliCredential

TOKEN_CACHE_FILE = "azure_token_cache.json"

class CachedCredential(TokenCredential):
    def __init__(self, underlying_credential):
        self.underlying_credential = underlying_credential
        self._token = None
        self._expires_on = None
        self.load_cached_token()

    def load_cached_token(self):
        if os.path.exists(TOKEN_CACHE_FILE):
            try:
                with open(TOKEN_CACHE_FILE, "r") as f:
                    cache = json.load(f)
                expiry_datetime = datetime.fromtimestamp(cache["expires_on"], timezone.utc)
                if expiry_datetime > datetime.now(timezone.utc):
                    self._token = cache["token"]
                    self._expires_on = cache["expires_on"]
                    print("Loaded cached token, expires at:", expiry_datetime)
            except Exception as e:
                print("Failed to load cached token:", e)

    def save_token(self):
        cache = {"token": self._token, "expires_on": self._expires_on}
        with open(TOKEN_CACHE_FILE, "w") as f:
            json.dump(cache, f)

    def get_token(self, *scopes, **kwargs):
        now_ts = datetime.now(timezone.utc).timestamp()
        if self._token is None or now_ts >= self._expires_on:
            token_obj = self.underlying_credential.get_token(*scopes, **kwargs)
            self._token = token_obj.token
            self._expires_on = token_obj.expires_on
            self.save_token()
            expiry_datetime = datetime.fromtimestamp(self._expires_on, timezone.utc)
            print("Fetched new token, expires at:", expiry_datetime)
        return AccessToken(self._token, self._expires_on)

def main():
    underlying_credential = AzureCliCredential()
    cached_credential = CachedCredential(underlying_credential)
    
    token_obj = cached_credential.get_token("https://storage.azure/.default")
    token_str = token_obj.token
    expiry_datetime = datetime.fromtimestamp(token_obj.expires_on, tz=timezone.utc)
    
    print("\nAccess Token:")
    print(token_str)
    print("\nExpires On:")
    print(expiry_datetime)
    storage_options = {
        "account_name": "xxxxx",  # Replace with your actual storage account name.
        "credential": cached_credential  # Pass the credential object.
    }
    try:
        df = pd.read_csv("abfs://sample/001.csv", storage_options=storage_options)
        print("\nDataFrame Head:")
        print(df.head())
    except Exception as e:
        print("\nError reading file:", e)

if __name__ == "__main__":
    main()

Output:

Fetched new token, expires at: 2025-02-03 09:48:59+00:00

Access Token:
xxxxx

Expires On:
2025-02-03 xxxxx9+00:00

DataFrame Head:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

But I will suggest you use SAS token as an alternative to using credentials like AzureCliCredential

Code:

import pandas as pd

storage_options = {
    "account_name": "your_account_name",
    "sas_token": "your_sas_token"
}

df = pd.read_csv("abfs://your_container/your_file.csv", storage_options=storage_options)
print(df.head())

You can generate sas token for long time expiration to read the csv files.

Reference:

  • Use Pandas to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn
  • timeout - Azure Synapse time out - Token expire - Stack Overflow by JayashankarGS

本文标签: pythonToken caching when using Pandas to read files from Azure StorageStack Overflow