admin管理员组文章数量:1313362
I can read files from Azure Storage into Pandas like this
import pandas as pd
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
pd.read_csv(
"abfs://my_container/my_file.csv",
storage_options={'account_name': 'my_account', 'credential': credential}
)
Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?
I can read files from Azure Storage into Pandas like this
import pandas as pd
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
pd.read_csv(
"abfs://my_container/my_file.csv",
storage_options={'account_name': 'my_account', 'credential': credential}
)
Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?
Share Improve this question edited Feb 2 at 18:58 qkfang 1,7871 silver badge20 bronze badges asked Jan 31 at 14:10 techtechtechtech 1902 silver badges10 bronze badges 1- If you read multiple files in the same session, the original filesystem instance with its credentials will be reused. Are you suggesting persisting the token between sessions? How long does it live? – mdurant Commented Jan 31 at 15:34
1 Answer
Reset to default 2Getting the token from AzureCliCredential is slow. Is there a way to make pandas/fsspec cache the token so that the slow token retrieval process is not repeated over and over again when I open many files?
I agree with mdurant's comment, that if you read multiple files within the same session, fsspec
should reuse the original filesystem instance and its credentials.
When using AzureCliCredential
, the token lifetime depends on the Azure AD configuration, and it typically lasts for 1 hour
before expiring.
You can use the below code that explains to persist and reuse tokens
across sessions by caching them to disk
.
Code:
import json
import os
from datetime import datetime, timezone
import pandas as pd
from azure.core.credentials import AccessToken, TokenCredential
from azure.identity import AzureCliCredential
TOKEN_CACHE_FILE = "azure_token_cache.json"
class CachedCredential(TokenCredential):
def __init__(self, underlying_credential):
self.underlying_credential = underlying_credential
self._token = None
self._expires_on = None
self.load_cached_token()
def load_cached_token(self):
if os.path.exists(TOKEN_CACHE_FILE):
try:
with open(TOKEN_CACHE_FILE, "r") as f:
cache = json.load(f)
expiry_datetime = datetime.fromtimestamp(cache["expires_on"], timezone.utc)
if expiry_datetime > datetime.now(timezone.utc):
self._token = cache["token"]
self._expires_on = cache["expires_on"]
print("Loaded cached token, expires at:", expiry_datetime)
except Exception as e:
print("Failed to load cached token:", e)
def save_token(self):
cache = {"token": self._token, "expires_on": self._expires_on}
with open(TOKEN_CACHE_FILE, "w") as f:
json.dump(cache, f)
def get_token(self, *scopes, **kwargs):
now_ts = datetime.now(timezone.utc).timestamp()
if self._token is None or now_ts >= self._expires_on:
token_obj = self.underlying_credential.get_token(*scopes, **kwargs)
self._token = token_obj.token
self._expires_on = token_obj.expires_on
self.save_token()
expiry_datetime = datetime.fromtimestamp(self._expires_on, timezone.utc)
print("Fetched new token, expires at:", expiry_datetime)
return AccessToken(self._token, self._expires_on)
def main():
underlying_credential = AzureCliCredential()
cached_credential = CachedCredential(underlying_credential)
token_obj = cached_credential.get_token("https://storage.azure/.default")
token_str = token_obj.token
expiry_datetime = datetime.fromtimestamp(token_obj.expires_on, tz=timezone.utc)
print("\nAccess Token:")
print(token_str)
print("\nExpires On:")
print(expiry_datetime)
storage_options = {
"account_name": "xxxxx", # Replace with your actual storage account name.
"credential": cached_credential # Pass the credential object.
}
try:
df = pd.read_csv("abfs://sample/001.csv", storage_options=storage_options)
print("\nDataFrame Head:")
print(df.head())
except Exception as e:
print("\nError reading file:", e)
if __name__ == "__main__":
main()
Output:
Fetched new token, expires at: 2025-02-03 09:48:59+00:00
Access Token:
xxxxx
Expires On:
2025-02-03 xxxxx9+00:00
DataFrame Head:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
But I will suggest you use SAS token as an alternative to using credentials like AzureCliCredential
Code:
import pandas as pd
storage_options = {
"account_name": "your_account_name",
"sas_token": "your_sas_token"
}
df = pd.read_csv("abfs://your_container/your_file.csv", storage_options=storage_options)
print(df.head())
You can generate sas
token for long time expiration to read the csv files.
Reference:
- Use Pandas to read/write ADLS data in serverless Apache Spark pool in Synapse Analytics - Azure Synapse Analytics | Microsoft Learn
- timeout - Azure Synapse time out - Token expire - Stack Overflow by JayashankarGS
本文标签: pythonToken caching when using Pandas to read files from Azure StorageStack Overflow
版权声明:本文标题:python - Token caching when using Pandas to read files from Azure Storage - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741915471a2404705.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论