javascript - Get information from a web page (title, pictures, heads, etc...) - Stack Overflow

IT技术

更新时间：2025-02-199

admin管理员组
文章数量:1336656

In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?

And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??

Thanks.

And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??

Thanks.

Share Improve this question edited Jan 24, 2011 at 12:27 asked Jan 24, 2011 at 12:19 vtortola 35.9k31 gold badges167 silver badges268 bronze badges

may be this will helpfull info\: stackoverflow./questions/680562/… – Musa Commented Apr 21, 2016 at 6:42

Add a ment |

5 Answers 5

Sorted by: Reset to default 6

Basic Methodology

When the fetch event is triggered (for example on Facebook pasting a URL in) you can use AJAX to request the url*, then parse the returned data as you wish.

Parsing the data is the tricky bit, because so many websites have varying standards. Taking the text between the title tags is a good start, along with possibly searching for a META description (but these are being used less and less as search engines evolve into more sophisticated content based searches).

Failing that, you need some way of finding the most important text on the page and taking the first 100 chars or so as well as finding the most prominent picture on the page.

This is not a trivial task, it is extremely plicated trying to derive semantics from such a liquid and contrasting set of data (a generic returned web page). For example, you might find the biggest image on the page, that's a good start, but how do you know it's not a background image? How do you know that's the image that best describes that page?

Good luck!

*If you can't directly AJAX third party URL's, this can be done by requesting a page on your local server which fetches the remote page server side with some sort of HTTP request.

Some Extra Thoughts

If you grab an image from a remote server and 'hotlink' it on your site, many sites seem to sometimes have 'anti hotlinking' replacement images when you try and display this image, so it might be worth paring the requested image from your server page with the actual fetched image so you don't show anything nasty by accident.

A lot of title tags in the head will be generic and non descriptive, it would be better to fetch the title of the article (assuming an article type site) if there is one available as it will be more descriptive, finding this is difficult though!

If you are really smart, you might be able to piggy back off Google for example (check their T&C though). If a user requests a certain URL, you can google search it behind the scenes, and use the returned google descriptive text as your return text. If google changes their markup significantly though this could break very quickly!

You can use a PHP server side script to fetch the contents of any web page (look up web scraping). What facebook does is it throws out a call to a PHP server side script via ajax which has a PHP function called

file_get_contents('http://somesite..au');

now once the file or webpage has been sucked into your server-side script you can then filter the contents for anything in particular. eg. Facebooks get link will look for the title, img and meta property="description parts of the file or webpage via regular expression

eg. PHP's

preg_match(); Function.

This can be collected then returned back to your webpage.

You may also want to consider adding extra functions for returning the data you want as scraping some pages may take longer than expected to return your desired information. eg. filter out irrelevant stuff like javascript, css, irrelavant tags, huge images etc. to make it run faster.

If you get this down pat you could potentialy be on your way to building a web search engine or better yet, collecting data off sites like yellowpages, eg. phone numbers, mailing addresses, etc.

Also you may want to look further into:

get_meta_tags('http://somesite..au');

:-)

There are several API's that can provide this functionality, for example PageMunch lets you pass in a url and callback so that you can do this from the client-side or feed it through your own server:

http://www.pagemunch.

An example response for the BBC website looks like:

{
"inLanguage": "en",
"schema": "http:\/\/schema\/WebPage",
"type": "WebPage",
"url": "http:\/\/www.bbc.co.uk\/",
"name": "BBC - Homepage",
"description": "Breaking news, sport, TV, radio and a whole lot more. The BBC informs, educates and entertains - wherever you are, whatever your age.",
"image": "http:\/\/static.bbci.co.uk\/wwhomepage-3.5\/1.0.64\/img\/iphone.png",
"keywords": [
   "BBC",
   "bbc.co.uk",
   "bbc.",
   "Search",
   "British Broadcasting Corporation",
   "BBC iPlayer",
   "BBCi"
],
"dateAccessed": "2013-02-11T23:25:40+00:00"
}

You can always just look what it in the tag. If you need this in javascript it shouldn't be that hard. Once you have the data you can do:

var title = $(data).find('title').html();

The problem will be getting the data since I think most browsers will block you from making cross site ajax requests. You can get around this by having a service on your site which will act as a proxy and make the request for you. However, at that point you might as well parse out the title on the server. Since you didn't specify what your back-end language is, I won't bother to guess now.

It's not possible with pure JavaScript due to cross domain policy - client side script can't read contents of pages on other domains unless that other domain explicitly expose JSON service.

The trick is sending server side request (each server side language has its own tools), parse the results using Regular Expressions or some other string parsing techniques then using this server side code as "proxy" to AJAX call made "on the fly" when posting link.

本文标签：

版权声明：本文标题：javascript - Get information from a web page (title, pictures, heads, etc...) - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1739939454a2213138.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

javascript - Get information from a web page (title, pictures, heads, etc...) - Stack Overflow

5 Answers 5

更多相关文章

javascript - User Input in Google Spreadsheet Script - Stack Overflow

javascript - Cannot read property &#39;clone&#39; of undefined - Stack Overflow

jquery - Resize images to be fullscreen with javascript? - Stack Overflow

javascript - Expected an assignment or function call and instead saw an expression error - Stack Overflow

loop - Use ACF Category Image for all Taxonomy Archive Views

技嘉b365dv3主板黑苹果efi_黑苹果--技嘉 z390 gaming X 究极方案

Android Studio快捷键失效的问题及实用快捷键汇总

【YOLO部署Android安卓手机APP】YOLOv8部署到安卓实时目标检测识别——官方自训练模型YOLOv8人脸车辆等目标检测（可自定义更换其他目标）

手机可以打开MySQL的软件_太牛逼了！一款软件几乎可以操作所有的数据库!

如何用Qt写一个安卓Android应用

【小技巧】安卓远程adb操作

android设备连接工具箱,ADB工具连接Android手机

如何查看笔记本电脑型号和各种信息

【Windows】Linux 远程连接工具SecureCRT9.1、SecureFX9.1的安装

XMind 8 Update 9 安装激活

axure 注册码

Beyond Compare Pro 2025 注册版

文档压缩软件 NXPowerLite v10.3.1 一键激活 永久授权

Windows Server 2008 R2 OVF (2025 年 2 月更新) - VMware 虚拟机模板

USB共享(USB-Over-Network)5.02带注册码

发表评论

推荐文章

How to Resolve Google cloud and Firebase &quot;Organization Policy Restricts Users from Specific Domains&quot; Error? -

plugins - How to fix wrong attribute error for Visual Composer Grid Builder?

r - Measuring shortest distance between animal locations and renewable energy (temporal accuracy needed) - Stack Overflow

Replace admin bar logo

How to get post ID of the current pagepost inside a widget?

热门文章

functions - How can I grab the video id of youtube?

Javascript Console Trouble - Stack Overflow

javascript - Display the data onto webpage retrieved from mongodb using node.js - Stack Overflow

php - Class &#39;WP_Privacy_Requests_Table&#39; not found

javascript - fastest way to compare a string with a array of strings - Stack Overflow

categories - Hook when category is added to post

Change flash embed size with javascript or html? - Stack Overflow

javascript - ckeditor not loading on element generated via ajax call? - Stack Overflow

javascript - TinyMCE textarea can&#39;t edit - Stack Overflow

wpdb - get_results not returning anything

最新文章

Uninstall Tool安装教程

USB共享(USB-Over-Network)5.02带注册码

Windows Server 2008 R2 OVF (2025 年 2 月更新) - VMware 虚拟机模板

文档压缩软件 NXPowerLite v10.3.1 一键激活 永久授权

VMware Workstation Pro 17.5.2 + license key

javascript - Chrome extension as static server - Stack Overflow

loop - Use ACF Category Image for all Taxonomy Archive Views

javascript - How to call touch or click event from a function for an element which was generated dynamically - Stack Overflow

ImportError: Couldn&#39;t import Django. Are you sure it&#39;s installed and available on your PYT - Stack Overflow

javascript - Expected an assignment or function call and instead saw an expression error - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Cannot read property 'clone' of undefined - Stack Overflow

文档压缩软件 NXPowerLite v10.3.1 一键激活永久授权

How to Resolve Google cloud and Firebase "Organization Policy Restricts Users from Specific Domains" Error? -

php - Class 'WP_Privacy_Requests_Table' not found

javascript - TinyMCE textarea can't edit - Stack Overflow

文档压缩软件 NXPowerLite v10.3.1 一键激活永久授权

ImportError: Couldn't import Django. Are you sure it's installed and available on your PYT - Stack Overflow