admin管理员组文章数量:1312853
I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.
When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
Sometimes both files A and B fail to upload with the same exception.
Files are about 200 MB parquet files.
The other 30-40% of the jobs which do succeed do not experience this network issue.
What could be the cause of this intermittent failure? How would one go about debugging this?
EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.
I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.
When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
Sometimes both files A and B fail to upload with the same exception.
Files are about 200 MB parquet files.
The other 30-40% of the jobs which do succeed do not experience this network issue.
What could be the cause of this intermittent failure? How would one go about debugging this?
EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.
Share Improve this question edited Feb 10 at 14:59 AGS asked Feb 1 at 20:31 AGSAGS 4277 silver badges18 bronze badges 1- Are you uploading from a custom boto3 app or awscli? What endpoint are you using? Does this traffic route via IGW or NAT or VPC Endpoint to S3? Which EC2 instance type? How many concurrent uploads to the same S3 bucket are you making? – jarmod Commented Feb 3 at 15:05
1 Answer
Reset to default -1Will need some code snippets to dig further... There are some similar answers including:
- networking issue / VPN
- BOTO3 variables where you can set things like
cli-connection-timeout
which might help - Multiprocessing in Python issues
- Number of other issues where things like packet drops / API taking too long, etc
Questions to ask:
- Are they all in the same VPC?
- Is there any difference between scripts?
- Is there any data skew (where some files are much larger than others)
- If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
- Are you sure they're all on-demand and not being dropped as spot instances?
- Lastly are you using long term connections throughout? Like do you have the following below:
cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)
If you do then maybe the cnxn
is too long lived
本文标签: amazon web servicesEC2 to S3 uploads fail randomlyStack Overflow
版权声明:本文标题:amazon web services - EC2 to S3 uploads fail randomly - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741864040a2401794.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论