admin管理员组文章数量:1336656
In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?
And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??
Thanks.
In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?
And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??
Thanks.
Share Improve this question edited Jan 24, 2011 at 12:27 vtortola asked Jan 24, 2011 at 12:19 vtortolavtortola 35.9k31 gold badges167 silver badges268 bronze badges 1- may be this will helpfull info\: stackoverflow./questions/680562/… – Musa Commented Apr 21, 2016 at 6:42
5 Answers
Reset to default 6Basic Methodology
When the fetch event is triggered (for example on Facebook pasting a URL in) you can use AJAX to request the url*, then parse the returned data as you wish.
Parsing the data is the tricky bit, because so many websites have varying standards. Taking the text between the title tags is a good start, along with possibly searching for a META description (but these are being used less and less as search engines evolve into more sophisticated content based searches).
Failing that, you need some way of finding the most important text on the page and taking the first 100 chars or so as well as finding the most prominent picture on the page.
This is not a trivial task, it is extremely plicated trying to derive semantics from such a liquid and contrasting set of data (a generic returned web page). For example, you might find the biggest image on the page, that's a good start, but how do you know it's not a background image? How do you know that's the image that best describes that page?
Good luck!
*If you can't directly AJAX third party URL's, this can be done by requesting a page on your local server which fetches the remote page server side with some sort of HTTP request.
Some Extra Thoughts
If you grab an image from a remote server and 'hotlink' it on your site, many sites seem to sometimes have 'anti hotlinking' replacement images when you try and display this image, so it might be worth paring the requested image from your server page with the actual fetched image so you don't show anything nasty by accident.
A lot of title tags in the head will be generic and non descriptive, it would be better to fetch the title of the article (assuming an article type site) if there is one available as it will be more descriptive, finding this is difficult though!
If you are really smart, you might be able to piggy back off Google for example (check their T&C though). If a user requests a certain URL, you can google search it behind the scenes, and use the returned google descriptive text as your return text. If google changes their markup significantly though this could break very quickly!
You can use a PHP server side script to fetch the contents of any web page (look up web scraping). What facebook does is it throws out a call to a PHP server side script via ajax which has a PHP function called
file_get_contents('http://somesite..au');
now once the file or webpage has been sucked into your server-side script you can then filter the contents for anything in particular. eg. Facebooks get link will look for the title, img and meta property="description parts of the file or webpage via regular expression
eg. PHP's
preg_match(); Function.
This can be collected then returned back to your webpage.
You may also want to consider adding extra functions for returning the data you want as scraping some pages may take longer than expected to return your desired information. eg. filter out irrelevant stuff like javascript, css, irrelavant tags, huge images etc. to make it run faster.
If you get this down pat you could potentialy be on your way to building a web search engine or better yet, collecting data off sites like yellowpages, eg. phone numbers, mailing addresses, etc.
Also you may want to look further into:
get_meta_tags('http://somesite..au');
:-)
There are several API's that can provide this functionality, for example PageMunch lets you pass in a url and callback so that you can do this from the client-side or feed it through your own server:
http://www.pagemunch.
An example response for the BBC website looks like:
{
"inLanguage": "en",
"schema": "http:\/\/schema\/WebPage",
"type": "WebPage",
"url": "http:\/\/www.bbc.co.uk\/",
"name": "BBC - Homepage",
"description": "Breaking news, sport, TV, radio and a whole lot more. The BBC informs, educates and entertains - wherever you are, whatever your age.",
"image": "http:\/\/static.bbci.co.uk\/wwhomepage-3.5\/1.0.64\/img\/iphone.png",
"keywords": [
"BBC",
"bbc.co.uk",
"bbc.",
"Search",
"British Broadcasting Corporation",
"BBC iPlayer",
"BBCi"
],
"dateAccessed": "2013-02-11T23:25:40+00:00"
}
You can always just look what it in the tag. If you need this in javascript it shouldn't be that hard. Once you have the data you can do:
var title = $(data).find('title').html();
The problem will be getting the data since I think most browsers will block you from making cross site ajax requests. You can get around this by having a service on your site which will act as a proxy and make the request for you. However, at that point you might as well parse out the title on the server. Since you didn't specify what your back-end language is, I won't bother to guess now.
It's not possible with pure JavaScript due to cross domain policy - client side script can't read contents of pages on other domains unless that other domain explicitly expose JSON service.
The trick is sending server side request (each server side language has its own tools), parse the results using Regular Expressions or some other string parsing techniques then using this server side code as "proxy" to AJAX call made "on the fly" when posting link.
本文标签:
版权声明:本文标题:javascript - Get information from a web page (title, pictures, heads, etc...) - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1739939454a2213138.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论