admin管理员组

文章数量:1405101

We host a set of "resource" pages - a collection of useful links for our users. For years we've had a script run daily - looping through each link and sending one php Guzzle HEAD request to make sure each page on the resource sites is active.

But over the past few years, I suspect as more and more sites adopt Cloudflare, sites are returning 403 codes to the HEAD request, and it's getting to the point where it's pretty useless to do this.

Is there a way to do this that isn't going to get this traffic treated as malicious? I don't need the content from the other sites... just simply to know if the pages are in good working order.

Here's the PHP code I'm using:

$client = new Client();
$request = $client->head($encoded_link);
$request->setOptions(['userAgent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36']);
$response = $request->send();

We host a set of "resource" pages - a collection of useful links for our users. For years we've had a script run daily - looping through each link and sending one php Guzzle HEAD request to make sure each page on the resource sites is active.

But over the past few years, I suspect as more and more sites adopt Cloudflare, sites are returning 403 codes to the HEAD request, and it's getting to the point where it's pretty useless to do this.

Is there a way to do this that isn't going to get this traffic treated as malicious? I don't need the content from the other sites... just simply to know if the pages are in good working order.

Here's the PHP code I'm using:

$client = new Client();
$request = $client->head($encoded_link);
$request->setOptions(['userAgent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36']);
$response = $request->send();
Share Improve this question asked Mar 8 at 16:20 Charlie ShehadiCharlie Shehadi 31 bronze badge 3
  • I haven't tried this myself, but you could read the headers with a normal get request, and don't read the body. See, for instance: chriswhite.blog/coding/… You still need to work out what headers you get in which situation. – KIKO Software Commented Mar 8 at 16:41
  • Agreed, just do a GET instead of a HEAD. It's not going to tell you that the site is "working" per se (i.e., a "down for maintenance" page is still going to return a success), but you're not getting that now and this will at least tell you if the server is actively responding. – Alex Howansky Commented Mar 8 at 16:53
  • Yes, using GET instead of HEAD does a much better job. – Charlie Shehadi Commented Mar 10 at 15:51
Add a comment  | 

1 Answer 1

Reset to default -1

There are a number of points that should be able to help you, and different ways of proceeding depending on your needs.

  1. If the number of resources you want to check is not too high you might use some monitoring services Tools like UptimeRobot, Pingdom.

  2. For the most realistic approach, you may consider using a headless browser through PHP libraries like chrome-php, php-webdriver or Symfony Panther, which would interact with sites just like a real browser. It takes a bit of work at first, but it will be very effective.

  3. Your script can be improved:

    1. Use GET instead of HEAD requests
      Many security systems are more suspicious of HEAD requests since they're commonly used by automated tools but rarely by real users. Switching to GET requests might help:
      $request = $client->get($encoded_link);

    2. Improve your user agent string

      Your current user agent is somewhat outdated (Chrome 61). Use a more recent browser signature:

      $options = [
          'headers' => [
              'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
          ]
      ];
      $request = $client->get($encoded_link, $options);
      
    3. Add realistic headers

      Include headers that typical browsers would send:

      $options = [
          'headers' => [
              'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
              'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
              'Accept-Language' => 'en-US,en;q=0.9',
              'Accept-Encoding' => 'gzip, deflate, br',
              'Connection' => 'keep-alive',
              'Upgrade-Insecure-Requests' => '1',
              'Sec-Fetch-Dest' => 'document',
              'Sec-Fetch-Mode' => 'navigate',
              'Sec-Fetch-Site' => 'none',
              'Sec-Fetch-User' => '?1'
          ]
      ];
      

本文标签: guzzleWhat is a legitimate way in PHP to test if a thirdparty site is workingStack Overflow