admin管理员组

文章数量:1327555

I'm using a Python script (Mechanize) to login to a proxy portal. I can login successfully. I can check that from read() function.

However, after successful login, I couldn't access the blocked sites by the proxy. So I checked the HTTP headers from FF and found that Connection: Keep-alive. But from mechanize, I found Connection: close. I tried to imitate the HTTP header exactly as from FF using browser.addheaders but this didn't work as well :(

After deep digging, I found a couple of suggestions that the server closes the connection because mechanize can't totally emulate a browser as the webpage contains JS which is not supported by mechanize

So, is there a way to emulate (make the server feel) that mechanize is a browser (supports JS), even though it doesn't?

BTW, I don't need JS, I can login successfully as I mentioned above. And please don't suggest PhantomJS. I need a Python package to do the job not a headless browser.

Update:

FireFox Headers:

GET xxx HTTP/1.1
Host: xxx
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: DSLastAccess=1454082611
Connection: keep-alive


HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:18:32 GMT; secure
Date: Fri, 29 Jan 2016 16:18:32 GMT
x-frame-options: SAMEORIGIN
Connection: Keep-Alive
Keep-Alive: timeout=15
Pragma: no-cache
Cache-Control: no-store
Expires: -1
Transfer-Encoding: chunked

Mechanize addheaders:

browser.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),\
            ('Accept-Language', 'en-US,en;q=0.5'),\
            ('Accept-Encoding', 'gzip, deflate'),\
            ('Host', 'xxx'),\
            ('Connection','keep-alive'),\
            ('Cookie', 'DSLastAccess=1454082611'),\
            ('User-agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0')]

Mechanize Headers

send: 'CONNECT xxx:443 HTTP/1.0\r\n'
send: '\r\n'
send: 'GET xxx.cgi HTTP/1.1\r\nAccept-Language: en-US,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nHost: xxx\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0\r\nConnection: close\r\nCookie: DSLastAccess=1454082611\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/html; charset=utf-8
header: Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:31:03 GMT; secure
header: Date: Fri, 29 Jan 2016 16:31:03 GMT
header: x-frame-options: SAMEORIGIN
header: Connection: close
header: Pragma: no-cache
header: Cache-Control: no-store
header: Expires: -1

Another thing that drives me crazy, that the sent Connection from mechanize is : close even though I've set it as keep-alive as you can see in addheaders

I'm using a Python script (Mechanize) to login to a proxy portal. I can login successfully. I can check that from read() function.

However, after successful login, I couldn't access the blocked sites by the proxy. So I checked the HTTP headers from FF and found that Connection: Keep-alive. But from mechanize, I found Connection: close. I tried to imitate the HTTP header exactly as from FF using browser.addheaders but this didn't work as well :(

After deep digging, I found a couple of suggestions that the server closes the connection because mechanize can't totally emulate a browser as the webpage contains JS which is not supported by mechanize

So, is there a way to emulate (make the server feel) that mechanize is a browser (supports JS), even though it doesn't?

BTW, I don't need JS, I can login successfully as I mentioned above. And please don't suggest PhantomJS. I need a Python package to do the job not a headless browser.

Update:

FireFox Headers:

GET xxx HTTP/1.1
Host: xxx
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: DSLastAccess=1454082611
Connection: keep-alive


HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:18:32 GMT; secure
Date: Fri, 29 Jan 2016 16:18:32 GMT
x-frame-options: SAMEORIGIN
Connection: Keep-Alive
Keep-Alive: timeout=15
Pragma: no-cache
Cache-Control: no-store
Expires: -1
Transfer-Encoding: chunked

Mechanize addheaders:

browser.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),\
            ('Accept-Language', 'en-US,en;q=0.5'),\
            ('Accept-Encoding', 'gzip, deflate'),\
            ('Host', 'xxx'),\
            ('Connection','keep-alive'),\
            ('Cookie', 'DSLastAccess=1454082611'),\
            ('User-agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0')]

Mechanize Headers

send: 'CONNECT xxx:443 HTTP/1.0\r\n'
send: '\r\n'
send: 'GET xxx.cgi HTTP/1.1\r\nAccept-Language: en-US,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nHost: xxx\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0\r\nConnection: close\r\nCookie: DSLastAccess=1454082611\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/html; charset=utf-8
header: Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:31:03 GMT; secure
header: Date: Fri, 29 Jan 2016 16:31:03 GMT
header: x-frame-options: SAMEORIGIN
header: Connection: close
header: Pragma: no-cache
header: Cache-Control: no-store
header: Expires: -1

Another thing that drives me crazy, that the sent Connection from mechanize is : close even though I've set it as keep-alive as you can see in addheaders

Share Improve this question edited Feb 1, 2016 at 23:04 Mogsdad 45.8k21 gold badges162 silver badges285 bronze badges asked Jan 27, 2016 at 14:41 user5174680user5174680 10
  • 1 There is nothing in HTTP headers about JS. Keep-alive is probably not relevant here. You should probably post the HTTP headers (both request and response) in both working and not working version. Edit out the session cookie or whatever, but check if it was there. – Sergey Salnikov Commented Jan 27, 2016 at 17:41
  • @SergeySalnikov, thanks for the reply. I'm not saying that there is something in HTTP headers about JS. I'm just saying that from the HTTP headers I can tell that the server closes the connection. And that's, probably, because the server can tell that mechanize is not a browser. And it can tell because it doesn't see support for JS. So it recognizes mechanize as NOT a browser – user5174680 Commented Jan 28, 2016 at 13:32
  • Do you mean the server closes the connection without any reply? – Sergey Salnikov Commented Jan 28, 2016 at 15:27
  • @SergeySalnikov, no of course it replis. I mean when I check the server HTTP header it has Connection: close – user5174680 Commented Jan 29, 2016 at 15:15
  • As far as I know, there's no way a HTTP server detect client javascript support. The most mon way to detect client is by User-Agent header property. It would be great if you post request/response headers, as suggested by @SergeySalnikov – Miguel A. Baldi Hörlle Commented Jan 29, 2016 at 15:33
 |  Show 5 more ments

1 Answer 1

Reset to default 7 +50

For linux

Foremost, I know some people dont just wanta suggestion to switch to another option. However, I believe that if you want to access the page entirely after logging in, (which currently fails due to no javascript support) you should look into using Selenium.

You can grab it with a quick sudo pip install selenium.

Accessing a webpage is as easy as declaring your browser, then telling your browser to go to the desired webpage. Here, i have attached a basic sample to make your browser go to a webpage, the page im using relies heavily on javascript:

import selenium
from selenium import webdriver

try:
    browser = webdriver.Firefox()
    browser.get('mikekus.')
except KeyboardInterrupt:
    browser.quit()

This works, because selenium actually opens a browser. However, if you wish to hide the browser, so you dont have to see it and have it in your taskbar.

I remend the following setup using pyvirtualdisplay which will hide the browser using visible=0. It is worth noting pyvirtualdisplay is a wrapper, for Xvfb and as such requires you install it as well. You can get it with sudo apt-get install xvfb:

import selenium
from selenium import webdriver
from pyvirtualdisplay import Display


try:
    display = Display(visible=0, size=(800, 600))
    display.start()
    browser = webdriver.Firefox()
    browser.get('mikekus.')

except KeyboardInterrupt:
    browser.quit()
    display.stop()

I will leave the filling in login forms, etc. To you, as its quite simple if your read the docs, as everyone should. Navigating With Selenium

Granted, in your situation you are trying to access the proxy, then access another site. This method implies you would direct the proxy to the webpage from the proxys page itself, through accessing fields on the page. Im sure with a bit of time you could continue navigating to multiple pages and page elements, again with a bit of research.

I hope this helps. Good luck.

本文标签: pythonHow to emulate a browser with JavaScript support via MechanizeStack Overflow