admin管理员组

文章数量:1391934

We have a service with this architecture:

  • HTTPS requests come into an A10 load balancer that does L4 load balancing
  • Behind it are 2 backend servers with Apache running that terminate the TLS connection
  • In Apache there is a ProxyPass rule that talks to a http service on localhost
  • this local service uses gunicorn and is implemented in Python using aiohttp (Python 3.11.2 running on Debian Linux ("bookworm") running as a VMWare VM)

Now we have some cases where users reported timeouts. Both the python application and Apache write logs, and so we've traced some of the timeouts to a weird delay. The python application logs a message right before it passes it response to gunicorn (and thus to Apache), and sometimes the time between this log and the Apache access log is really long, like 300s.

This is a somewhat rare occurrence, in a day where we process about 880k requests there are just 17 requests where the delay is 30s or more, which makes this kinda hard to debug.

Capturing all the network traffic in a huge .pcap file and then sifting through it is kinda hopeless, far too much data, and the responses from Apache are encrypted, so it makes it really hard to trace request IDs in the pcap files.

Most of the requests with delays have a response size of a few kilobytes to a few megabytes, though very seldomly we also see slow requests with response body <1kb, which should fit into a single TCP package.

Does anybody have a good hypothesis where this delay could come from, and how I could debug it?

We have a service with this architecture:

  • HTTPS requests come into an A10 load balancer that does L4 load balancing
  • Behind it are 2 backend servers with Apache running that terminate the TLS connection
  • In Apache there is a ProxyPass rule that talks to a http service on localhost
  • this local service uses gunicorn and is implemented in Python using aiohttp (Python 3.11.2 running on Debian Linux ("bookworm") running as a VMWare VM)

Now we have some cases where users reported timeouts. Both the python application and Apache write logs, and so we've traced some of the timeouts to a weird delay. The python application logs a message right before it passes it response to gunicorn (and thus to Apache), and sometimes the time between this log and the Apache access log is really long, like 300s.

This is a somewhat rare occurrence, in a day where we process about 880k requests there are just 17 requests where the delay is 30s or more, which makes this kinda hard to debug.

Capturing all the network traffic in a huge .pcap file and then sifting through it is kinda hopeless, far too much data, and the responses from Apache are encrypted, so it makes it really hard to trace request IDs in the pcap files.

Most of the requests with delays have a response size of a few kilobytes to a few megabytes, though very seldomly we also see slow requests with response body <1kb, which should fit into a single TCP package.

Does anybody have a good hypothesis where this delay could come from, and how I could debug it?

Share Improve this question edited Mar 13 at 14:58 moritz asked Mar 13 at 14:52 moritzmoritz 12.8k1 gold badge43 silver badges63 bronze badges 3
  • 1 what are the requests, what is the loading at the time the delays are encountered (time of day) , are the delays across multiple days .... can a delay be recreated (replaying requests .... ) are there any h/ware issues - dodgy line cards, is this hosted locally, a cloud provider .... , add more debugging .... – ticktalk Commented Mar 17 at 22:50
  • Are you able to get access to the payloads that were subject to delay? Are other requests at the same time being processed fine? Can you reproduce it in a lab environment somehow? – Mo_ Commented Mar 18 at 16:26
  • 1 The nature of the requests doesn't really matter, since they are handled by yet another service. According to Prometheus, the load is never high. This happens on 4 different Vmware VMs in two different clusters and locations, so unlikely a dodgy hardware problem. The problem appears in QA too, but not reproducibly. Replaying requests doesn't reproduce the problem. Other requests seem to be processed fine (need to look in the logs to be sure). – moritz Commented Mar 19 at 13:49
Add a comment  | 

1 Answer 1

Reset to default 1 +100

These sporadic 30–300s response delays typically stem from slow or stalled client reads or rare TCP stalls (packet loss/retransmissions). Your aiohttp app logs “done” after passing data to Gunicorn, while Apache only logs when the client fully receives it. A slow or flaky network/client causes the proxy to hold the socket open, delaying the final Apache log. Checking socket states (ss, netstat), Apache’s mod_status, and TCP retransmissions can confirm this. The rest of the system is likely fine—this is common when dealing with a small fraction of slow or interrupted clients.

本文标签: apacheWhat could cause strange delays while sending delays from a python aiohttp serverStack Overflow