<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title><![CDATA[网络与电脑专家：闻雷 - 网络营销战略]]></title>
<link>http://www.wenlei.net/</link>
<description><![CDATA[—中国电子商务探索者]]></description>
<language>zh-cn</language>
<copyright><![CDATA[Copyright 2006]]></copyright>
<webMaster><![CDATA[wenlei@vip.qq.com(闻雷)]]></webMaster>
<generator>wenlei</generator> 
<image>
	<title>网络与电脑专家：闻雷</title> 
	<url>http://www.wenlei.net/images/logos.gif</url> 
	<link>http://www.wenlei.net/</link> 
	<description>网络与电脑专家：闻雷</description> 
</image>

			<item>
			<link>http://www.wenlei.net/default.asp?id=292</link>
			<title><![CDATA[网络营销重在客户分析]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Tue,20 Jan 2009 11:22:28 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=292</guid>	
		<description><![CDATA[<p align="left">网络营销的重心应该放在客户分析上，只有了解客户，才能有的放矢的进行营销。在对客户分析的时候，我们应该重点把握住以下几点： <br /><br />1、客户是谁？&mdash;&mdash;你需要找到简单而明确的识别方法，使我们无论在什么情况下都可以迅速的获知对方是否为我们的客户。 <br /><br /><br />&nbsp;&nbsp;&nbsp;通常客户必须是使用产品或服务的人，或者是对产品或服务感兴趣的人群。通常人们会按照年龄、性别、职业、文化和收入等情况来建立客户识别特征。但有些时候，这种识别方式并不可靠。有些产品，对任何人来说都可能是可有可无的，比如说&ldquo;可乐&rdquo;，作为一种碳酸饮料，任何人（不包括糖尿病患者等特殊人群）都可以喝，但又不是必须要消费的产品。在这种情况下，你如何识别&ldquo;可乐&rdquo;这种碳酸饮料的客户呢？！也许你可以通过各种市场调查了解到哪些群体在消费可乐类产品，然后据此制定你的营销策略。如果你这么做，估计你永远也不会成为像&ldquo;可口可乐&rdquo;那样的大公司，也不会创造出&ldquo;牛仔裤&rdquo;这种畅销全球的商品。像这类商品，很多成功的企业是通过广告的方式来引导客户的需求，以广告带动需求的模式。可要可不要的产品，必须通过对客户不断的劝导和激烈来达到引起客户消费之目的。在这方面，似乎保健品行业颇有经验。再垃圾的商品，通过他们的一阵乱吹，就把消费者给搞晕了，纷纷购买，糊涂消费，害得国家主管部门不得不频繁下文干预，制止虚假广告。 <br /><br /><img src="http://www.wenlei.net/attachments/month_0901/w2009120112915.jpg" align="middle" alt="" /><br /><strong>图解：</strong> <span id="_spnPicNote">黄勇老师在2007年西部商务人才论坛分享商务人才培养经验</span><br /><br />2、客户如何选择商品？&mdash;&mdash;同样的商品，每位客户选择的理由可能各有不同。但无论怎样，客户选择产品的理由，通常不会超出产品的功能范围，用学术点的说法就是通常不会超出商品的使用价值。有的客户在乎产品的价格，有的则更看重品质，有的看重品牌，有的看重性能&hellip;&hellip;&nbsp;在制定正式的营销计划之前，我们必须弄清楚，我们销售的商品它具备哪些使用价值，它的价格与它的使用价值匹配吗？匹配的程度如何？用户是否认可这种匹配？用户最关心产品的哪些方面？我们的产品能否满足用户这方面的需要？如果不能满足，我们又如何使得它能满足或者改变商品使用价值与价格的比例关系，使其能够达到用户内心的匹配关系？以家装产品橱柜为例，人们购买橱柜，不同层面的人群选择商品的侧重点是不同的。例如普通工薪阶层或者有实用主义倾向的人群，会选择价格低廉、耐酸碱性好，易清洁的三聚氰胺面板的橱柜类产品。除了价格之外，橱柜产品的面板的耐酸碱、防嗮、防油烟污垢的能力，台面的防渗能力，面板的外观设计，颜色的搭配，环保性能，橱柜的电器配置、水槽、碗架等配套设施、技术含量等方面都是客户选择的依据。在制定营销计划的时候，应该明确大多数客户对不同方面的考虑侧重，并有针对性的进行营销。 <br /><br />3、客户如何购买商品？&mdash;&mdash;这是关于销售渠道的问题。营销人员必须明确客户通过什么渠道来购买产品，他们购买产品是经常遇到哪些困难？！渠道在选择商品时重点要考虑哪些因素？！如何减少客户购买时遇到的各种阻力？如何将营销信息最直接地传递给客户？！比如你销售的是各类铸铁炒锅、炖锅，这类产品一般会通过农贸市场、超市等零售出去。而这些零售商又是通过批发商那里购进。因此，作为生产商的你，应该把销售的重点放在对批发商和零售商身上。而对新产品，应该考虑对终端消费者市场进行广告的宣传，从唤起终端消费者的需求，来带动中间商的消费。 <br /><br /><br />4、客户上网的情况？&mdash;&mdash;我们既然是在做网络营销，就必须清楚客户会在哪里上网，他们经常会访问哪些网站，在这些网站上，他们经常会访问哪些主题或栏目，这些网站有哪些营销资源？哪些与你的产品有关，哪些适合于推广你的产品？哪些需要收费，哪些是免费的？这些将成为你制定网络营销策略的终于依据。 <br /><br />在上面四个方面的分析过程中，我们要时刻记住：客户最关心的是什么？！从而用重点用针对性地进行营销！ <br /><br />作者：黄勇&nbsp;成都理工大学工程技术学院管理系&nbsp;电子商务教研室主任&nbsp;于208.12.17&nbsp;</p>]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=285</link>
			<title><![CDATA[回忆2008年谈网络营销]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sat,03 Jan 2009 14:31:10 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=285</guid>	
		<description><![CDATA[网络营销是什么？网络营销就是能让人目不暇接、能让人流连忘返、能让人意乱情迷、能让人改变初衷的在线营销与品牌推广。<br /><br />奥巴马击败希拉里，招招不离互联网。奥巴马对麦凯恩出招，又是招招不离互联网。电子邮件、手机短信、视频网站，不一而足。在奥巴马的竞选网站上，可以直接点击观看奥巴马视频，在线购买奥巴马标记的产品，下载奥巴马的演讲作为手机铃声。奥巴马玩转互联网，像表演&ldquo;花样游泳&rdquo;一样牵引着网民手中的鼠标。<br /><br />网络营销一定不是&ldquo;百度+阿里巴巴&rdquo;，不是几个弹出的广告，不是&ldquo;BBS炒作&rdquo;，而是一系列网络推广手段的整合。网络营销是组合拳，是花样游泳，得有一双双修长的美腿、一套漂亮的动作来&ldquo;勾引&rdquo;观众。<br /><br />组合投资可以分散风险，就要注重&ldquo;二八原则&rdquo;，但同时也不能忽视&ldquo;长尾效应&rdquo;。搜索引擎、BBS、视频、电子杂志就是发挥着&ldquo;八&rdquo;作用的&ldquo;二&rdquo;，但是还有更多的&ldquo;长尾&rdquo;不容忽视，如博客、SNS、百度贴吧等。<br />搜索引擎一度在网络推广中发挥了显赫作用，而随着竞价排名的透明化，用户也开始擦亮眼睛。相比之下网络BBS营销则更加隐蔽而不至于让人生厌，今年有&ldquo;两个一亿元&rdquo;不得不提。王老吉亿元捐款&ldquo;遭封杀&rdquo;是比较成功的BBS营销案例，而盘古大观编造&ldquo;盖茨亿元租楼看奥运&rdquo;的假新闻在各大论坛进行炒作，虽然知名度上去了，但是美誉度却堪比脑白金。<br /><br />《一个馒头引发的血案》最大限度地为《无极》做了推广。前不久看到太平洋汽车网上一个视频《希特勒怒骂迈腾上市》，利用名人希特勒正话反说，实属一绝，目前点击已近20万。七星购物在B2C领域知名度远不及当当和卓越，甚至略逊于京东商城、红孩子，而今年以来其通过商品视频化的销售方式，大幅提高了商品的关注度。一度被视作有&ldquo;黄色嫌疑&rdquo;的&ldquo;美女视频聊天&rdquo;也得以成功应用，由美女导购员使用视频聊天室的方式，直接向用户展示商品或提供视频导购服务，受到了空前的欢迎。<br /><br />
<p align="center"><img alt="" src="http://www.wenlei.net/attachments/month_0901/h200913142940.jpg" /></p>
<br /><br />正所谓&ldquo;得终端者得天下&rdquo;。有研究表明，导购员在终端的诱导购买，经常使顾客改变购买初衷。&ldquo;视频营销&rdquo;就宛如网络终端的商品导购员，通过直接介绍、间接诱导等形式，巧妙嵌入品牌和产品信息。一段生动形象的视频，可以使需求犹豫者更加坚定，使无需求者产生需求。<br /><br />此外，像博客、圈子、SNS、百度知道、贴吧等这些&ldquo;长尾&rdquo;，在网络营销中正在爆发出惊人的能量。只有用好了这些长尾，才能玩转&ldquo;花样游泳&rdquo;。特别是意见领袖的博客和个人空间，几乎成了&ldquo;个人媒体&rdquo;，在影响着其周围的一群人。百度知道、新浪爱问和天涯问答这些社区提供了在西方从未真正出现过的服务。据悉，肯德基等品牌已经开始赞助某些专题，同目标客户进行沟通，这一热门领域非常值得继续探索。而新近崛起的开心网，正在以几何级数的量增长着用户，一个网络营销的新战场又将诞生。其&ldquo;买好友&rdquo;、&ldquo;租车位&rdquo;的操作思路让人耳目一新，连门户网站新浪网也不得不效仿之。<br /><br />坚持互动，就是交41.5万个朋友也不嫌多<br /><br />web2.0最大的贡献就在于互动，这也是它与传统媒体最大的差别。它通过关系把人们联系在一起分享观点、价值，并进行交流和沟通，形成&ldquo;让大家告诉大家&rdquo;、&ldquo;一传十十传百&rdquo;的口碑传播。<br /><br />在Web2.0的网络时代，人人都成了信息源的传播者、制造者，比传销更公开、更迅猛、更可信的病毒式营销，正在悄悄地&ldquo;侵蚀&rdquo;每一个网民。一不留神，人人都成了&ldquo;病毒携带者&rdquo;。在网络这个言论自由的空间，每个人都可以找到志同道合的战友，大家乐此不疲地讨论着关心的话题。这就延伸出一个问题：如何引导网民向着有利于品牌方向的话题进行讨论？<br /><br />当你第一次看到&ldquo;盖茨亿元租楼&rdquo;的新闻，你也许会权当一乐，心想可能是网友恶搞。而当各大网站铺天盖地的新闻和BBS帖子拨动着你的神经时，你的心理防线早已被击垮。而在康师傅&ldquo;水源门&rdquo;事件中，企业方的反应迟钝给了竞争对手以可乘之机。关于康师傅的质疑声充斥着各大论坛，要不是刘翔退赛事件抢了网络眼球，康师傅的处境或许会更遭。<br /><br />2007年年底以来一度闹得沸沸扬扬的&ldquo;华南虎&rdquo;事件，疑似&ldquo;城市公关&rdquo;。它虽然不像是一个策划缜密的城市推广案例，但是依靠网络的力量，陕西镇坪县一时声名鹊起却是不争的事实。随着&ldquo;虎照&rdquo;传播面的越来越广，镇坪县那句&ldquo;游自然国心，闻华南虎啸，品镇坪腊肉&rdquo;的广告语也真成了镇坪的&ldquo;旅游名片&rdquo;。<br /><br />无独有偶，2008年7月28日，马云的头像雕塑成为杭州特色雕塑展的参赛作品之一。一夜之间，这一事件也迅速成为网络热点话题。而事实上，这个被取名为《偶像&mdash;马云》的作品并没有得到马云的同意。杭州市却借此露足了脸，进行了一次成功的&ldquo;城市公关&rdquo;。<br /><br />四两拨千斤，能省一元是一元<br /><br />有人说，人的大脑上存储记忆的地方只有6英寸宽，就像汉堡那么宽。网络空间虽然巨大，网络资源虽然取之不尽，但实际上网络营销是在占领网民&ldquo;6英寸&rdquo;的地盘。<br /><br />一代伟人毛泽东说，&ldquo;在意识形态领域，如果马克思主义的东西不去占领，那么非马克思主义的东西就要去占领。&rdquo;在网络营销领域其实也是一样的，人的大脑空间是有限的，所以，在网络六英寸的战场，如何迅速&ldquo;占位&rdquo;是很重要的，也是很需要技巧的。<br /><br />公司的财务部不是印钞机，做品牌不是花钱玩。更多的时候要用巧劲，做到&ldquo;四两拨千斤&rdquo;，能省一元是一元。试想，&ldquo;王老吉遭封杀&rdquo;的网络营销花了多少钱？编制一个有趣的视频在网络上不停地投放能花费多少钱？网络营销不是砸广告，所以还得精打细算，让每一分钱都能产生两分钱甚至一毛钱的效益。<br />马一样也能筹到竞选需要的巨额资金。<br /><br />随着BBS、SNS、IM、RSS、BLOG、SEO技术的快速发展，用户的体验方式正在发生着巨大的变化。面对多样化的网络营销手段，优秀的企业需要做的不是等待与观望，而是大胆决策，果断进入，快速实现低成本高收益的网络营销回报。快如闪电、敢为人先，是众多品牌网络营销的成功心得，肯德基已经开始向百度知道渗透，你还在犹豫什么？<br /><br />美国总统的网络大战尚未结束，&ldquo;网络营销大师&rdquo;奥巴马正在抓紧准备最后的角逐。无论奥巴马最终能否入主白宫，他留给我们的是不可多得的品牌推广经验和对Web2.0的信心。<br /><br />当然，Web2.0也是一柄双刃剑，成功和失败也许就在一瞬间。在美国总统初选中，民主党候选人霍华德&middot;迪安、共和党议员伯恩斯就是因为偶然失误，结果在网络上被人夸张放大而被淘汰出局。而在2008年的中国，艳照门、万科捐款门、康师傅水源门、微软&ldquo;蕃茄门&rdquo;等等这些网络营销的负面案例也不断地给我们敲响警钟。是剑指天下还是挥剑自刎？或许就在一念之间。<br /><br />抢占6英寸的主战场，你准备好了吗？]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=197</link>
			<title><![CDATA[GOOGLE PR 值今日更新！---什么是Google PR值？ 如何提高PR值？]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sat,27 Oct 2007 10:53:40 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=197</guid>	
		<description><![CDATA[今日是2007年10月27日，Google PR 值再一次更新，本博客由PR=1上升为PR=3，恭喜一下！<br /><br />下面详细介绍一下GOOGLE&nbsp;&nbsp; PR：<br /><br />Google大受青睐的另一个原因就是它的网站索引速度。向Google提交你的网站直到为Google收录，一般只需两个星期。如果你的网站已经为Google收录，那么通常Google会每月一次遍历和更新(重新索引)你的网站信息。不过对于那些PR值(Pagerank)较高的网站，Google索引周期会相应的短一些。 <br /><br />Google的索引/重新索引周期比大多数搜索引擎要短。这就允许网站管理员可以对网站的页面属性进行编辑修改，如网页标题、头几行文字内容、大字标题、关键字分布，当然了还有外部链接的数量。然后他们很快就可以发现对网页所做的这些更改是否成功。 <br /><br />正因为Google如此受欢迎，你有必要知道Google的搜索引擎是如何工作的。如果不知道它是怎样决定你的排名，那么那些只是稍微熟悉Google排名运算法则的站点都会比你的排名位置要靠前。现在让我们来看一下Google的排名运算法则。 <br /><br />Google的排名运算法则主要使用了两个部分，第一个部分是它的文字内容匹配系统。Google使用该系统来发现与搜索者键入的搜索词相关的网页；第二部分也是排名运算法则中最最重要的部分，就是Google的专利网页级别技术（Pagerank?）。 <br /><br />我先来介绍一下如何使网站具有相关性，即文本内容匹配部分的运算法则： <br /><br />在搜索网站的关键字时，Google会对其标题标签(meta title)中出现的关键字给予较高的权值。所以你应当确保在你网站的标题标签中包含了最重要的关键词，即应围绕你最重要的关键词来决定网页标题的内容。不过网页的标题不可过长，一般最好在35到40个字符之间。 <br /><br />众所周知，Google并不使用元标签(Meta Tags)如关键字或描述标签。这是由于在这些元标签中所使用的文字并不能为实际的访问者所看到。而且Google认为，这些元标签会被某些网站管理员用于欺诈性地放置一些与其网站毫不相干的热门关键词，并以此提高其网站对该不相干关键词的排名，从而以不正当的手段获得更多的访问者。 <br /><br />这种不支持Meta Tags的特性，意味着Google将从一个网页的头几行文字内容来生成对一个网站的描述。也就是说，你最好把你的关键字或关键短语放到网页的上方，这样如果Google找到它们，就会相应提高你网站的相关性。一旦Google找不到这样相关的内容，那么你要花费很大的力气来让你页面的其它部分具有相关性。 <br /><br />在决定一个网站的相关性时，Google也会考虑网页中正文内容的关键字密度(Keyword Density)，所以你要确保在你的整个网页中贯穿出现了若干次关键词和关键短语。但是要记住&ldquo;过犹不及&rdquo;，6-10%的关键词密度为最佳。 <br /><br />增加页面相关性的其它策略还包括：在标题内容中放入关键词，并尽可能对内容中出现的关键词进行加粗。Google现在也索引图片的ALT属性文字并计入相关性计算。所以在你的ALT属性中应包含关键词，来增加网站的相关性得分。 <br /><br />增加页面相关性的最后一个技巧就是使你网站上的外部文字链接包含你的关键字。在外部文字链接中包含关键字可有效提高你的网站相关性得分（Google在其PageRank技术的描述中，亦提及在计算网页级别时会对该网站的外部链接进行分析并计入相关性）。 <br /><br />在文字链接中应该包含多少关键字？这是个见仁见智的问题。不过我注意到有很多网站在他们的交换链接区域，已经提供了相应的文字链接内容。例如：&ldquo;欢迎进行友情链接，并请使用如下代码建立至本网站的链接。&rdquo; <br /><br />上面我们介绍了Google如何计算网站的相关性，及如何增加网站相关性的一些有关知识。不过Google究竟使用什么方法来衡量一个网站的好坏呢？答案就是－Google的Pagerank?系统。 <br /><br />PageRank取自Google的创始人Larry Page，它是Google排名运算法则（排名公式）的一部分，用来标识网页的等级/重要性。级别从1到10级，10级为满分。PR值越高说明该网页越受欢迎（越重要）。例如：一个PR值为1的网站表明这个网站不太具有流行度，而PR值为7到10则表明这个网站非常受欢迎（或者说极其重要）。 <br /><br />在计算网站排名时，PageRank会将网站的外部链接数考虑进去。我们可以这样说：一个网站的外部链接数越多其PR值就越高；外部链接站点的级别越高（假如Macromedia的网站链到你的网站上），网站的PR值就越高。例如：如果ABC.COM网站上有一个XYZ.COM网站的链接，那么ABC.COM网站必须提供一些较好的网站内容，从而Google会把来自XYZ.COM的链接作为它对ABC.COM网站投的一票。你可以下载和安装Google的工具条来检查你的网站级别（PR值）。 <br /><br />那么是不是说，一个网站的外部链接数越高（获得的投票越多）， 这个网站就越重要，因而在用与其相关的关键词进行搜索时，它就会取得更高的排名呢？－－大错特错。 <br /><br />Google对一个网站上的外部链接数的重视程度并不意味着你因此可以不求策略地与任何网站建立连接。这是因为Google并不是简单地由计算网站的外部链接数来决定其等级。要是那样的话，所有网站管理员就只剩一件事情可做了－疯狂交换链接，尽可能获得最多的外部链接。Google是这样描述的：&ldquo;Google不只是看一个网站的投票数量，或者这个网站的外部链接数量。同时，它也会对那些投票的网站进行分析。如果这些网站的PR值比较高（具有相当重要性），则其投票的网站可从中受益（亦具有重要性）。 <br /><br />那么，是不是说对一个网站而言，它所拥有的较高网站质量和较高PR分值的外部链接数量越多就越好呢？－也不尽然。 <br /><br />说它错是因为－Google的Pagerank系统不单考虑一个网站的外部链接质量，也会考虑其数量。比方说，对一个有一定PR值的网站X来说，如果你的网站Y是它的唯一一个外部链接，那么Google就相信网站X将你的网站Y视做它最好的一个外部链接，从而会给你的网站Y更多的分值。可是，如果网站X上已经有49个外部链接，那么Google就相信网站X只是将你的网站视做它第50个好的网站。因而你的外部链接站点上的外部链接数越多，你所能够得到的PR分值反而会越低，它们呈反比关系。 <br /><br />说它对是因为－一般情况下，一个PR分值大于等于6的外部链接站点，可显著提升你的PR分值。但如果这个外部链接站点已经有100个其它的外部链接时，那你能够得到的PR分值就几乎为零了。同样，如果一个外部链接站点的PR值仅为2，但你却是它的唯一一个外部链接，那么你所获得的PR值要远远大于那个PR值为6，外部链接数为100的网站。 <br /><br />这个问题现在看来好象越来越复杂了。不要紧，看看下面这个公式你就会完全理解了，只是需要一点数学知识。 <br /><br />首先让我们来解释一下什么是阻尼因数(damping factor)。阻尼因素就是当你投票或链接到另外一个站点时所获得的实际PR分值。阻尼因数一般是0.85。当然比起你网站的实际PR值，它就显得微不足道了。现在让我们来看看这个PR分值的计算公式： <br /><br />PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn)) <br /><br />其中PR(A)表示的是从一个外部链接站点t1上，依据Pagerank?系统给你的网站所增加的PR分值；PR(t1)表示该外部链接网站本身的PR分值；C(t1)则表示该外部链接站点所拥有的外部链接数量。大家要谨记：一个网站的投票权值只有该网站PR分值的0.85，而且这个0.85的权值平均分配给其链接的每个外部网站。 <br /><br />设想一个名为akamarketing.com的网站，被链接至PR值为4，外部链接数为9的网站XYZ.COM，则计算公式如下： <br /><br />PR(AKA) = (1-0.85) + 0.85*(4/10) <br /><br />PR(AKA) = 0.15 + 0.85*(0.4) <br /><br />PR(AKA) = 0.15 + 0.34 <br /><br />PR(AKA) = 0.49 <br /><br />也就是说，如果我的网站获得一个PR值为4，外部链接数为9的网站的链接，最后我的网站将获得0.49的PR分值。 <br /><br />再让我们看看如果我的网站获得的是一个PR分值为8，外部链接数为16的网站的链接，那么我将获得的PR分值将是： <br /><br />PR(AKA) = (1-0.85) + 0.85*(8/16) <br /><br />PR(AKA) = 0.15 + 0.85(0.5) <br /><br />PR(AKA) = 0.15 + 0.425 <br /><br />PR(AKA) = 0.575 <br /><br />上述两个例子表明，外部链接站点的PR值固然重要，该站点的外部链接数也是一个需要考虑的重要因素。 <br /><br />好了，大家无须记住上面的公式，只要记住：在建设你自己网站的外部链接时，应尽可能找那些PR值高且外部链接数又少的网站。这样一来你网站上这样的外部链接站点越多，你的PR值就会越高，从而使得你的排名得到显著提升。 <br /><br />不过，为使你的PR值得到提高，你最应该做的一件事情就是－向DMOZ提交你的网站，从而为DMOZ，即ODP（开放目录专案）收录。 <br /><br />众所周知，Google的Pagerank?系统对那些门户网络目录如DMOZ，Yahoo和Looksmart尤为器重。特别是对DMOZ。一个网站上的DMOZ链接对Google的Pagerank?来说，就好象一块金子一样有价值。这时候收录该网站的那个DMOZ目录页的PR分值，也变得无足轻重了。我就看到过有一些站点，就因为被ODP所收录，从而身价倍增，其PR分值在Google上立即得到提升。这就是因为Google使用了它自己的ODP版本作为它的网站目录。 <br /><br />ODP的链接对Pagerank?非常重要。如果你的网站为ODP收录，则可有效提升你的页面等级。不信吗？ <br /><br />确实如此。在Google上随便找个词搜索，你会发现，Google所提供的搜索结果的头10个站点中，就有7到8个也同时在Google的目录中出现。这个事实足以说明，如果一个网站没有被ODP收录的话，那它也别指望能从Google上得到太多的访问量。 <br /><br />向ODP提交你的站点并为它收录，其实并不是一件难事，只是要多花点时间而已。只要确保你的网站提供了良好的内容，然后在ODP合适的目录下点击&ldquo;增加站点&rdquo;，按照提示一步步来就OK了。至少要保证你的索引页(INDEX PAGE)被收录进去。我说&ldquo;至少&rdquo;是因为尽管ODP声称他们只收录你的索引页，而事实上在ODP上却不乏被收录了5到10页的网站。所以，如果你的网站内容涉及完全不同的几块内容，你可以把每个内容的网页分别向ODP提交－不过请记住&ldquo;欲速则不达&rdquo;。等到Google对其目录更新后，你就能看到你的PR值会有什么变化了。 <br /><br />如果你的网站为Yahoo和Looksmart所收录，那么你的PR值会得到显著提升。关于&ldquo;Yahoo提交技巧&rdquo;，如果有时间可以阅读下面这篇文章&ldquo;Yahoo网站提交技巧&rdquo;。 <br /><br />如果你的网站是非商业性质的或几乎完全是非商业性质的内容，那么你可以通<a href="http://www.zeal.com/" target="_blank">www.Zeal.com</a>使你的网站为著名的网络目录Looksmart所收录。我个人非常喜爱ZEAL.COM，就象Google也从DMOZ获得搜索结果一样，Looksmart也是从Zeal网络目录获得非商业搜索列表。<br />]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=186</link>
			<title><![CDATA[网络爬虫(蜘蛛)-spider-crawler]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sun,23 Sep 2007 07:53:24 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=186</guid>	
		<description><![CDATA[<div class="t_msgfont"><font color="#333399" size="2">网络爬虫定义 所谓的网络爬虫，中文又称网络机器人或者网络蜘蛛。英文名有Spider， Crawler， Bots， Robot， Wanderer，Hotbot等。在[15]中对其分别进行了广义和狭义的两种定义：狭义的Spider就是指软件程序根据http协议利用超文本链接和检索超文本文档周游互联网信息空间。而广义的Spider则是指利用标准的http协议自动检索web文档的软件程序。世界上第一个用于监测互联网发展规模的&ldquo;机器人&rdquo;程序是Matthew Gray开发的World Wide Web Wanderer。刚开始它只用来统计互联网上的服务器数量，后来则发展为能够检索网站域名。与Wanderer相对应，Martin Koster于1993年10月创建了ALIWEB[16]，ALIWEB不使用&ldquo;机器人&rdquo;程序，而是靠网站主动提交信息来建立自己的链接索引，类似于现在我们熟知的Yahoo。到1993年底，一些基于此原理的搜索引擎开始纷纷涌现， 其中以JumpStation[17]、The World Wide Web Worm[18]和Repository Based Software Engineering(RBSE) spider[19]最负盛名。然而Jump Station和WWW Worm只是以搜索工具在数据库中找到匹配信息的先后次序排列搜索结果，因此毫无信息关联度可言。而RBSE是第一个在搜索结果排列中引入关键字串匹配程度概念的引擎[20]。随后在1994年Crawler产生. 最早的全文索引的爬虫程序是1994年的Repository Based Software Engineering (RBSE)之后Spider如雨后春笋涌现出来. 网络爬虫作用及其相关协议 据统计Spider的主要应用方向有以下五种： <br />个人搜索：基于主题的网页抓取 <br />网页集合：为搜索引擎等服务 <br />web统计：统计网络主机数目或者主机网页数目等等 <br />站点维护：检查死链接 <br />网络档案：为档案馆等服务，特定领域的. <br />Spider利弊参半，由于spider的存在，在Spider程序访问目标网站和网页的时候会加大网站服务器的和网络负载，同时可能有更严重的问题就是可能会传播一些网站拥有者本不想公开公布的内容，Robots.txt协议的目的就是告诉Spider，使其明确他可以抓取网站哪些内容，禁止抓取网站哪些内容。 Robots.txt是一种君子协定[21]，因为他完全需要Spider所有者自觉遵守该协议，而无法通过相关法律来强迫执行。Robots.txt当前有两种实现方式。其一是在网页语言中加入meta说明。Meta部分一般放在html语言的之间。对于网页的meta部分放在之间， content部分可供使用的参数有：</font></div>
<font size="3"><font color="#ff0000">
<div class="t_msgfont"><br /><font color="#333399" size="2">QUOTE:<br />Index：可对该网页进行抓取索引； Follow：可以访问该网页内的超链部分； Noindex：不对该网页进行索引； Nofollow：不遍历该网页内的超链部分；</font></div>
<div class="t_msgfont"><font color="#333399" size="2">若将index和noindex归为A组，follow和nofollow归为B组，要求是A，B组内部不能重复使用，A、B组间可以交叉。使用Meta tag使用方便，但是有其局限性，因为有些网站只是拒绝或者欢迎某些Spider程序对其进行操作，还有就是需要在每个页面都写工作量比较大，所以有必要了解另外一种君子协定来解决问题，那就是Robots.txt。这种协议规定在网站根目录下存放一个文件名为:robots.txt的文件，如果你的网站域名为</font><a href="http://www.example.com/"><font color="#333399" size="2">http://www.example.com/</font></a><font color="#333399" size="2">，那么该文件路径为:http:// </font><a href="http://www.example.com/robots.txt"><font color="#333399" size="2">www.example.com/robots.txt</font></a><font color="#333399" size="2">;如果域名为</font><a href="http://example.com/"><font color="#333399" size="2">http://example.com/</font></a><font color="#333399" size="2">，那么该文件路径为:http:// example.com/robots.txt.该文件格式如下： #为注释前缀 User-agent: 参数可以是*(表示所有)，也可以是某个spider 名，如：badspider Disallow:/(表示所以目录)或者其他目录文件 如：</font></div>
<div class="t_msgfont"><br /><font color="#333399" size="2">QUOTE:<br /># </font><a href="http://www.example.com/"><font color="#333399" size="2">http://www.example.com/</font></a><font color="#333399" size="2"> 网站robots.txt User-agent: * Disallow: /cyberworld/map/ #为一个死链接目录 User-agent: cybermapper Disallow:</font></div>
<div class="t_msgfont"><font color="#333399" size="2">以上文件表示对所有spider，均保留/cyberworld/map/不被索引;对于cybermapper的spider，允许其访问任何目录;最终结果也就是对cybermapper不作保留而其他的保留一个目录，在disallow部分，每个disallow至多只能有一个目录。若有多个目录则需要分几个写，如：</font></div>
<div class="t_msgfont"><br /><font color="#333399" size="2">QUOTE:<br /># </font><a href="http://www.example.com/"><font color="#333399" size="2">http://www.example.com/</font></a><font color="#333399" size="2"> 网站robots.txt User-agent: * Disallow: /cyberworld/map0/ #为一个死链接目录 User-agent: * Disallow: /cyberworld/map1/ #为一个死链接目录 User-agent: * Disallow: /cyberworld/map2/ #为一个死链接目录<br /></font><br />Wiki 上Spider相关知识 A web crawler (also known as a web spider or web robot) is a program which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000). Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Crawling policies There are two important characteristics of the Web that generate a scenario in which web crawling is very difficult: its large volume and its rate of change, as there are a huge number of pages being added, changed and removed every day. Also, network speed has improved less than current processing speeds and storage capacities. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. As Edwards et al. note, &quot;Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in a not only scalable, but efficient way if some reasonable measure of quality or freshness is to be maintained.&quot; (Edwards et al., 2001). A crawler must carefully choose at each step which pages to visit next. The behavior of a web crawler is the outcome of a combination of policies: A selection policy that states which pages to download. A re-visit policy that states when to check for changes to the pages. A politeness policy that states how to avoid overloading websites. A parallelization policy that states how to coordinate distributed web crawlers. [edit]Selection policy Given the current size of the Web, even large search engines cover only a portion of the publicly available content; a study by Lawrence and Giles (Lawrence and Giles, 2000) showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web. This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Website). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling. Cho et al. (Cho et al., 1998) made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies. The ordering metrics tested were breadth-first, backlink-count and partial Pagerank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain. Najork and Wiener (Najork and Wiener, 2001) performed an actual crawl on 328 million pages, using breadth-first ordering. They found that a breadth-first crawl captures pages with high Pagerank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that &quot;the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates&quot;. Abiteboul (Abitebout et al., 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of &quot;cash&quot; which is distributed equally among the pages it points to. It is similar to a Pagerank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of &quot;cash&quot;. Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies nor experiments in the real Web. Boldi et al. (Boldi et al., 2004) used simulation on subsets of the Web of 40 million pages from the .it domain and 100 million pages from the WebBase crawl, testing breadth-first against random ordering and an omniscient strategy. The winning strategy was breadth-first, although a random ordering also performed surprisingly well. One problem is that the WebBase crawl is biased to the crawler used to gather the data. They also showed how bad Pagerank calculations carried on partial subgraphs of the Web, obtained during crawling, can approximate the actual Pagerank. Baeza-Yates et al. (Baeza-Yates et al., 2005) used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one. [edit]Restricting followed links A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may alternatively examine the URL and only request the resource if the URL ends with .html, .htm or a slash. This strategy may cause numerous HTML web resources to be unintentionally skipped. Some crawlers may also avoid requesting any resources that have a &quot;?&quot; in them (are dynamically produced) in order to avoid spider traps which may cause the crawler to download an infinite number of URLs from a website. [edit]Path-ascending crawling Some crawlers intend to download as many resources as possible from a particular web site. Cothey (Cothey, 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of <a href="http://foo.org/a/b/page.html" target="_blank">http://foo.org/a/b/page.html</a>, it will attempt to crawl /a/b/, /a/, and /. Cothey found that a path-ascending crawler was very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling. [edit]Focused crawling The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawlers or topical crawlers. Focused crawling was first introduced by Chakrabarti et al. (Chakrabarti et al., 1999). The main problem in focused crawling is that in the context of a web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton (Pinkerton, 1994) in a crawler developed in the early days of the Web. Diligenti et al. (Diligenti et al., 2000) propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points. [edit]Crawling the Deep Web A vast amount of web pages lie in the deep or invisible web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google&rsquo;s Sitemap Protocol and mod_oai (Nelson et al., 2005) are intended to allow discovery of these deep-web resources. [edit]Re-visit policy The Web has a very dynamic nature, and crawling a fraction of the Web can take a long time, usually measured in weeks or months. By the time a web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most used cost functions, introduced in (Cho and Garcia-Molina, 2000), are freshness and age. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Age This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Evolution of freshness and age in Web crawlingCoffman et al. (Edward G. Coffman, 1998) worked with a definition of the objective of a web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the web crawler is the server and the websites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single website. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the web crawler. The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are. Two simple re-visiting policies were studied by Cho and Garcia-Molina (Cho and Garcia-Molina, 2003): Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. (In both cases, the repeated crawling order of pages can be done either at random or with a fixed order.) Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh. To improve freshness, we should penalize the elements that change too often (Cho and Garcia-Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. (Edward G. Coffman, 1998) note, &quot;in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible&quot;. Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. (Cho and Garcia-Molina, 2003a) show that the exponential distribution is a good fit for describing page changes, while (Ipeirotis et al., 2005) show how to use statistical tools to discover paramters that affect this distribution. Note that the re-visiting policies considered here regard all pages as homogeneous in terms of quality (&quot;all pages on the Web are worth the same&quot;), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy. [edit]Politeness policy As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include: Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. Server overload, especially if the frequency of accesses to a given server is too high. Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle. Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. A non-standard robots.txt file may use a &quot;Crawl-delay:&quot; parameter to indicate the number of seconds to delay between requests, and some commercial search engines like MSN and Yahoo will adhere to this interval. The first proposal for the interval between connections was given in (Koster, 1993) and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable. Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as an interval for accesses, and the WIRE crawler (Baeza-Yates and Castillo, 2002) uses 15 seconds as the default. The MercatorWeb crawler (Heydon and Najork, 1999) follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. (Dill et al., 2002) use 1 second. Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3&ndash;4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: &quot;... running a crawler which connects to more than half a million servers (...) generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen.&quot; (Brin and Page, 1998). [edit]Parallelization policy Main article: Distributed web crawling A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes. Cho and Garcia-Molina (Cho and Garcia-Molina, 2002) studied two types of policies: Dynamic assignment: With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to, for instance, dynamically balance the load of each crawler. With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls. There are two configurations of crawling architectures with dynamic assignments that have been described by Shkapenyuk and Suel (Shkapenyuk and Suel, 2002): A small crawler configuration, in which there is a central DNS resolver and central queues per website, and distributed downloaders. A large crawler configuration, in which the DNS resolver and the queues are also distributed. Static assignment: With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a website assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002). An effective assignment function must have three main properties: each crawling process should get approximately the same number of hosts (balancing property), if the number of crawling processes grows, the number of hosts assigned to each process must shrink (contra-variance property), and the assignment must be able to add and remove crawling processes dynamically. Boldi et al. (Boldi et al., 2004) propose to use consistent hashing, which replicates the buckets, so adding or removing a bucket does not requires re-hashing of the whole table to achieve all of the desired properties. crawling is an effective process synchronisation tool between the users and the search engine. [edit]Web crawler architectures High-level architecture of a standard web crawlerA crawler must have a good crawling strategy, as noted in the previous sections, but it also needs a highly optimized architecture. Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: &quot;While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.&quot; Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents other from reproducing the work. There are also emerging concerns about &quot;search engine spamming&quot;, which prevent major search engines from publishing their ranking algorithms. [edit]URL normalization Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization, also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. There are several types of normalization that may be performed including conversion of URLs to lowercase, removal of &quot;.&quot; and &quot;..&quot; segments, and adding trailing slashes to the non-empty path component (Pant, 2004). [edit]Crawler identification Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP request. Website administrators typically examine their web servers&rsquo; log and use the user agent field to determine which crawlers have visited the web server and how often. The user agent field may include a URL where the website administrator may find out more information about the crawler. Spambots and other malicious web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler. It is important for web crawlers to identify themselves so website administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their web pages to be indexed by a particular search engine. [edit]Examples of web crawlers The following is a list of published crawler architectures for general-purpose crawlers (excluding focused web crawlers), with a brief description that includes the names given to the different components and outstanding features: RBSE (Eichmann, 1994) was the first published web crawler. It was based on two programs: the first program, &quot;spider&quot; maintains a queue in a relational database, and the second program &quot;mite&quot;, is a modified www ASCII browser that downloads the pages from the Web. WebCrawler (Pinkerton, 1994) was used to build the first publicly-available full-text index of a sub-set of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query. World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command. Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is an URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server. CobWeb (da Silva et al., 1999) uses a central &quot;scheduler&quot; and a series of distributed &quot;collectors&quot;. The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl. Mercator (Heydon and Najork, 1999) is a modular web crawler written in Java. Its modularity arises from the usage of interchangeable &quot;protocol modules&quot; and &quot;processing modules&quot;. Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extract new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web. WebFountain (Edwards et al., 2001) is a distributed, modular crawler similar to Mercator but written in C++. It features a &quot;controller&quot; machine that coordinates a series of &quot;ant&quot; machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency. PolyBot [Shkapenyuk and Suel, 2002] is a distributed crawler written in C++ and Python, which is composed of a &quot;crawl manager&quot;, one or more &quot;downloaders&quot; and one or more &quot;DNS resolvers&quot;. Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: <a href="http://www.example.com/" target="_blank">http://www.example.com/</a> and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server. WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for &quot;subscriptions&quot; to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of &quot;seed&quot; URLs, WebRACE is continuously receiving new starting URLs to crawl from. Ubicrawler (Boldi et al., 2004) is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical &quot;agents&quot;; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures. FAST Crawler (Risvik and Michelsen, 2002) is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a &quot;document scheduler&quot; that maintains a queue of documents to be downloaded by a &quot;document processor&quot; that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a &quot;distributor&quot; module that exchanges hyperlink information. In addition to the specific crawler architectures listed above, there are general crawler architectures published by Cho (Cho and Garcia-Molina, 2002) and Chakrabarti (Chakrabarti, 2003). [edit]Open-source crawlers DataparkSearch is a crawler and search engine released under a GPL license. GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites. GRUB (acquired by Looksmart, no longer operational) was a distributed crawling project using an open architecture. Heritrix, the Internet Archive Crawler (Burner, 1997) is a crawler designed with the purpose of archiving periodic snapshots of a large portion of the Web. It uses several processes in a distributed fashion, and a fixed number of websites are assigned to each process. The inter-process exchange of URLs is carried in batch with a long time interval between exchanges, as this is a costly process. The Internet Archive Crawler also has to deal with the problem of changing DNS records, so it keeps an historical archive of the hostname to IP mappings. ht://Dig includes a Web crawler in its indexing engine. HTTrack uses a web crawler to create a mirror of a website for off-line viewing. It is written in C and released under the GPL. Larbin by Andreas Beder[1] Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package. WebBase is a crawler used by the Stanford WebBase Project. WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded Web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine. WIRE (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for Web characterization. [edit]See also Data mining Distributed web crawling Google Macurious PageRank Spambot Spider trap [edit]References Abiteboul, S., Preda, M., and Cobena, G. (2003). &quot;Adaptive on-line page importance computation&quot;. In Proceedings of the twelfth international conference on World Wide Web: 280-290. Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in web crawling. In Soft Computing Systems &ndash; Design, Management and Applications, pages 565&ndash;572, Santiago, Chile. IOS Press Amsterdam. Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A. (2005). Crawling a Country: Better Strategies than Breadth-First for Web Page ordering. In Proceedings of the Industrial and Practical Experience track of the 14th conference on World Wide Web, pages 864&ndash;872, Chiba, Japan. ACM Press. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004a). UbiCrawler: a scalable fully distributed Web crawler. Software, Practice and Experience, 34(8):711&ndash;726. Boldi, P., Santini, M., and Vigna, S. (2004b). Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 168-180, Rome, Italy. Springer. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107&ndash;117. Burner, M. (1997). Crawling towards eternity &ndash; building an archive of the World Wide Web. Web Techniques, 2(5). Castillo, C. (2004). Effective Web Crawling. PhD thesis, University of Chile. Chakrabarti, S. (2003). Mining the Web. Morgan Kaufmann Publishers. ISBN 1558607544 Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31(11&ndash;16):1623&ndash;1640. Cho, J., Garcia-Molina, H., and Page, L. (1998). &quot;Efficient crawling through URL ordering&quot;. In Proceedings of the seventh conference on World Wide Web. Cho, J. and Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. In Proceedings of ACM International Conference on Management of Data (SIGMOD), pages 117-128, Dallas, Texas, USA. Cho, J. and Garcia-Molina, H. (2002). Parallel crawlers. In Proceedings of the eleventh international conference on World Wide Web, pages 124&ndash;135, Honolulu, Hawaii, USA. ACM Press. Cho, J. and Garcia-Molina, H. (2003). Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4). Cho, J. and Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3(3). Cothey, V. (2004). &quot;Web-crawling reliability&quot;. Journal of the American Society for Information Science and Technology 55 (14). Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt. Dill, S., Kumar, R., Mccurley, K. S., Rajagopalan, S., Sivakumar, D., and Tomkins, A. (2002). Self-similarity in the web. ACM Trans. Inter. Tech., 2(3):205&ndash;223. Eichmann, D. (1994). The RBSE spider: balancing effective search against Web load. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Edward G. Coffman, Z. Liu, R. W. (1998). Optimal robot scheduling for Web search engines. Journal of Scheduling, 1(1):15&ndash;29. Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). &quot;An adaptive model for optimizing performance of an incremental web crawler&quot;. In Proceedings of the Tenth Conference on World Wide Web: 106-113. Heydon, A. and Najork, M. (1999). Mercator: A scalable, extensible Web crawler. World Wide Web Conference, 2(4):219&ndash;229. Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L. (2005) Modeling and managing content changes in text databases. In Proceedings of the 21st IEEE International Conference on Data Engineering, pages 606-617, April 2005, Tokyo. Kobayashi, M. and Takeda, K. (2000). &quot;Information retrieval on the web&quot;. ACM Computing Surveys 32 (2): 144-173. Koster, M. (1993). Guidelines for robots writers. Koster, M. (1995). Robots in the web: threat or treat ? ConneXions, 9(4). Koster, M. (1996). A standard for robot exclusion. Lawrence, S. and Giles, C. L. (2000). Accessibility of information on the web. Intelligence, 11(1), 32&ndash;39. McBryan, O. A. (1994). GENVL and WWWW: Tools for taming the web. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Miller, R. and Bharat, K. (1998). Sphinx: A framework for creating personal, site-specific web crawlers. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Elsevier Science. Marc Najork and Janet L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the Tenth Conference on World Wide Web, pages 114&ndash;118, Hong Kong, May 2001. Elsevier Science. Nelson, M. L. , Van de Sompel, H. , Liu, X., Harrison, T. L. and McFarland, N. (2005). &quot;mod_oai: An Apache module for metadata harvesting&quot;. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005): 509. Pant, G., Srinivasan, P., Menczer, F. (2004). &quot;Crawling the Web&quot;. Web Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis, 153-178. Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Risvik, K. M. and Michelsen, R. (2002). Search Engines and Web Dynamics. Computer Networks, vol. 39, pp. 289&ndash;302, June 2002. Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press. da Silva, A. S., Veloso, E. A., Golgher, P. B., Ribeiro-Neto, B. A., Laender, A. H. F., and Ziviani, N. (1999). Cobweb &ndash; a crawler for the Brazilian web. In Proceedings of String Processing and Information Retrieval (SPIRE), pages 184&ndash;191, Cancun, Mexico. IEEE CS Press. Zeinalipour-Yazti, D. and Dikaiakos, M. D. (2002). Design and implementation of a distributed crawler and filtering processor. In Proceedings of the Fifth Next Generation Information Technologies and Systems (NGITS), volume 2382 of Lecture Notes in Computer Science, pages 58&ndash;74, Caesarea, Israel. Springer. Retrieved from <a href="https://secure.wikimedia.org/wikipedia/en/wiki/Web_crawler">https://secure.wikimedia.org/wikipedia/en/wiki/Web_crawler</a><br /><br /><font size="3"><font color="#ff0000">网络爬虫技术框架与分析</font> Spider首先从一定的网页根集（Root）出发，按照某种周游策略根据robots.txt协议获取相关网页的内容，随后将该页面按照某种格式存储在本地硬盘并对该页面内容进行超链解析，将解析出的新地址存储在本地硬盘以备下次抓取。</font></div>
</font></font> A web crawler (also known as a web spider or web robot) is a program which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000). Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Crawling policies There are two important characteristics of the Web that generate a scenario in which web crawling is very difficult: its large volume and its rate of change, as there are a huge number of pages being added, changed and removed every day. Also, network speed has improved less than current processing speeds and storage capacities. The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted. As Edwards et al. note, &quot;Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in a not only scalable, but efficient way if some reasonable measure of quality or freshness is to be maintained.&quot; (Edwards et al., 2001). A crawler must carefully choose at each step which pages to visit next. The behavior of a web crawler is the outcome of a combination of policies: A selection policy that states which pages to download. A re-visit policy that states when to check for changes to the pages. A politeness policy that states how to avoid overloading websites. A parallelization policy that states how to coordinate distributed web crawlers. [edit]Selection policy Given the current size of the Web, even large search engines cover only a portion of the publicly available content; a study by Lawrence and Giles (Lawrence and Giles, 2000) showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web. This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Website). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling. Cho et al. (Cho et al., 1998) made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies. The ordering metrics tested were breadth-first, backlink-count and partial Pagerank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain. Najork and Wiener (Najork and Wiener, 2001) performed an actual crawl on 328 million pages, using breadth-first ordering. They found that a breadth-first crawl captures pages with high Pagerank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that &quot;the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates&quot;. Abiteboul (Abitebout et al., 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of &quot;cash&quot; which is distributed equally among the pages it points to. It is similar to a Pagerank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of &quot;cash&quot;. Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies nor experiments in the real Web. Boldi et al. (Boldi et al., 2004) used simulation on subsets of the Web of 40 million pages from the .it domain and 100 million pages from the WebBase crawl, testing breadth-first against random ordering and an omniscient strategy. The winning strategy was breadth-first, although a random ordering also performed surprisingly well. One problem is that the WebBase crawl is biased to the crawler used to gather the data. They also showed how bad Pagerank calculations carried on partial subgraphs of the Web, obtained during crawling, can approximate the actual Pagerank. Baeza-Yates et al. (Baeza-Yates et al., 2005) used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one. [edit]Restricting followed links A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may alternatively examine the URL and only request the resource if the URL ends with .html, .htm or a slash. This strategy may cause numerous HTML web resources to be unintentionally skipped. Some crawlers may also avoid requesting any resources that have a &quot;?&quot; in them (are dynamically produced) in order to avoid spider traps which may cause the crawler to download an infinite number of URLs from a website. [edit]Path-ascending crawling Some crawlers intend to download as many resources as possible from a particular web site. Cothey (Cothey, 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of , it will attempt to crawl /a/b/, /a/, and /. Cothey found that a path-ascending crawler was very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling. [edit]Focused crawling The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawlers or topical crawlers. Focused crawling was first introduced by Chakrabarti et al. (Chakrabarti et al., 1999). The main problem in focused crawling is that in the context of a web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton (Pinkerton, 1994) in a crawler developed in the early days of the Web. Diligenti et al. (Diligenti et al., 2000) propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points. [edit]Crawling the Deep Web A vast amount of web pages lie in the deep or invisible web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google&rsquo;s Sitemap Protocol and mod_oai (Nelson et al., 2005) are intended to allow discovery of these deep-web resources. [edit]Re-visit policy The Web has a very dynamic nature, and crawling a fraction of the Web can take a long time, usually measured in weeks or months. By the time a web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most used cost functions, introduced in (Cho and Garcia-Molina, 2000), are freshness and age. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Age This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Evolution of freshness and age in Web crawlingCoffman et al. (Edward G. Coffman, 1998) worked with a definition of the objective of a web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the web crawler is the server and the websites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single website. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the web crawler. The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are. Two simple re-visiting policies were studied by Cho and Garcia-Molina (Cho and Garcia-Molina, 2003): Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. (In both cases, the repeated crawling order of pages can be done either at random or with a fixed order.) Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh. To improve freshness, we should penalize the elements that change too often (Cho and Garcia-Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. (Edward G. Coffman, 1998) note, &quot;in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible&quot;. Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. (Cho and Garcia-Molina, 2003a) show that the exponential distribution is a good fit for describing page changes, while (Ipeirotis et al., 2005) show how to use statistical tools to discover paramters that affect this distribution. Note that the re-visiting policies considered here regard all pages as homogeneous in terms of quality (&quot;all pages on the Web are worth the same&quot;), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy. [edit]Politeness policy As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include: Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. Server overload, especially if the frequency of accesses to a given server is too high. Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle. Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. A non-standard robots.txt file may use a &quot;Crawl-delay:&quot; parameter to indicate the number of seconds to delay between requests, and some commercial search engines like MSN and Yahoo will adhere to this interval. The first proposal for the interval between connections was given in (Koster, 1993) and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable. Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as an interval for accesses, and the WIRE crawler (Baeza-Yates and Castillo, 2002) uses 15 seconds as the default. The MercatorWeb crawler (Heydon and Najork, 1999) follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. (Dill et al., 2002) use 1 second. Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3&ndash;4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: &quot;... running a crawler which connects to more than half a million servers (...) generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen.&quot; (Brin and Page, 1998). [edit]Parallelization policy Main article: Distributed web crawling A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes. Cho and Garcia-Molina (Cho and Garcia-Molina, 2002) studied two types of policies: Dynamic assignment: With this type of policy, a central server assigns new URLs to different crawlers dynamically. This allows the central server to, for instance, dynamically balance the load of each crawler. With dynamic assignment, typically the systems can also add or remove downloader processes. The central server may become the bottleneck, so most of the workload must be transferred to the distributed crawling processes for large crawls. There are two configurations of crawling architectures with dynamic assignments that have been described by Shkapenyuk and Suel (Shkapenyuk and Suel, 2002): A small crawler configuration, in which there is a central DNS resolver and central queues per website, and distributed downloaders. A large crawler configuration, in which the DNS resolver and the queues are also distributed. Static assignment: With this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. For static assignment, a hashing function can be used to transform URLs (or, even better, complete website names) into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a website assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. To reduce the overhead due to the exchange of URLs between crawling processes, the exchange should be done in batch, several URLs at a time, and the most cited URLs in the collection should be known by all crawling processes before the crawl (e.g.: using data from a previous crawl) (Cho and Garcia-Molina, 2002). An effective assignment function must have three main properties: each crawling process should get approximately the same number of hosts (balancing property), if the number of crawling processes grows, the number of hosts assigned to each process must shrink (contra-variance property), and the assignment must be able to add and remove crawling processes dynamically. Boldi et al. (Boldi et al., 2004) propose to use consistent hashing, which replicates the buckets, so adding or removing a bucket does not requires re-hashing of the whole table to achieve all of the desired properties. crawling is an effective process synchronisation tool between the users and the search engine. [edit]Web crawler architectures High-level architecture of a standard web crawlerA crawler must have a good crawling strategy, as noted in the previous sections, but it also needs a highly optimized architecture. Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: &quot;While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.&quot; Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents other from reproducing the work. There are also emerging concerns about &quot;search engine spamming&quot;, which prevent major search engines from publishing their ranking algorithms. [edit]URL normalization Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization, also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. There are several types of normalization that may be performed including conversion of URLs to lowercase, removal of &quot;.&quot; and &quot;..&quot; segments, and adding trailing slashes to the non-empty path component (Pant, 2004). [edit]Crawler identification Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP request. Website administrators typically examine their web servers&rsquo; log and use the user agent field to determine which crawlers have visited the web server and how often. The user agent field may include a URL where the website administrator may find out more information about the crawler. Spambots and other malicious web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler. It is important for web crawlers to identify themselves so website administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their web pages to be indexed by a particular search engine. [edit]Examples of web crawlers The following is a list of published crawler architectures for general-purpose crawlers (excluding focused web crawlers), with a brief description that includes the names given to the different components and outstanding features: RBSE (Eichmann, 1994) was the first published web crawler. It was based on two programs: the first program, &quot;spider&quot; maintains a queue in a relational database, and the second program &quot;mite&quot;, is a modified www ASCII browser that downloads the pages from the Web. WebCrawler (Pinkerton, 1994) was used to build the first publicly-available full-text index of a sub-set of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query. World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command. Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is an URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server. CobWeb (da Silva et al., 1999) uses a central &quot;scheduler&quot; and a series of distributed &quot;collectors&quot;. The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl. Mercator (Heydon and Najork, 1999) is a modular web crawler written in Java. Its modularity arises from the usage of interchangeable &quot;protocol modules&quot; and &quot;processing modules&quot;. Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extract new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web. WebFountain (Edwards et al., 2001) is a distributed, modular crawler similar to Mercator but written in C++. It features a &quot;controller&quot; machine that coordinates a series of &quot;ant&quot; machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency. PolyBot [Shkapenyuk and Suel, 2002] is a distributed crawler written in C++ and Python, which is composed of a &quot;crawl manager&quot;, one or more &quot;downloaders&quot; and one or more &quot;DNS resolvers&quot;. Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.:  and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server. WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for &quot;subscriptions&quot; to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of &quot;seed&quot; URLs, WebRACE is continuously receiving new starting URLs to crawl from. Ubicrawler (Boldi et al., 2004) is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical &quot;agents&quot;; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures. FAST Crawler (Risvik and Michelsen, 2002) is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a &quot;document scheduler&quot; that maintains a queue of documents to be downloaded by a &quot;document processor&quot; that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a &quot;distributor&quot; module that exchanges hyperlink information. In addition to the specific crawler architectures listed above, there are general crawler architectures published by Cho (Cho and Garcia-Molina, 2002) and Chakrabarti (Chakrabarti, 2003). [edit]Open-source crawlers DataparkSearch is a crawler and search engine released under a GPL license. GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites. GRUB (acquired by Looksmart, no longer operational) was a distributed crawling project using an open architecture. Heritrix, the Internet Archive Crawler (Burner, 1997) is a crawler designed with the purpose of archiving periodic snapshots of a large portion of the Web. It uses several processes in a distributed fashion, and a fixed number of websites are assigned to each process. The inter-process exchange of URLs is carried in batch with a long time interval between exchanges, as this is a costly process. The Internet Archive Crawler also has to deal with the problem of changing DNS records, so it keeps an historical archive of the hostname to IP mappings. ht://Dig includes a Web crawler in its indexing engine. HTTrack uses a web crawler to create a mirror of a website for off-line viewing. It is written in C and released under the GPL. Larbin by Andreas Beder[1] Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package. WebBase is a crawler used by the Stanford WebBase Project. WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded Web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine. WIRE (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for Web characterization. [edit]See also Data mining Distributed web crawling Google Macurious PageRank Spambot Spider trap [edit]References Abiteboul, S., Preda, M., and Cobena, G. (2003). &quot;Adaptive on-line page importance computation&quot;. In Proceedings of the twelfth international conference on World Wide Web: 280-290. Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in web crawling. In Soft Computing Systems &ndash; Design, Management and Applications, pages 565&ndash;572, Santiago, Chile. IOS Press Amsterdam. Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A. (2005). Crawling a Country: Better Strategies than Breadth-First for Web Page ordering. In Proceedings of the Industrial and Practical Experience track of the 14th conference on World Wide Web, pages 864&ndash;872, Chiba, Japan. ACM Press. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004a). UbiCrawler: a scalable fully distributed Web crawler. Software, Practice and Experience, 34(8):711&ndash;726. Boldi, P., Santini, M., and Vigna, S. (2004b). Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 168-180, Rome, Italy. Springer. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107&ndash;117. Burner, M. (1997). Crawling towards eternity &ndash; building an archive of the World Wide Web. Web Techniques, 2(5). Castillo, C. (2004). Effective Web Crawling. PhD thesis, University of Chile. Chakrabarti, S. (2003). Mining the Web. Morgan Kaufmann Publishers. ISBN 1558607544 Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31(11&ndash;16):1623&ndash;1640. Cho, J., Garcia-Molina, H., and Page, L. (1998). &quot;Efficient crawling through URL ordering&quot;. In Proceedings of the seventh conference on World Wide Web. Cho, J. and Garcia-Molina, H. (2000). Synchronizing a database to improve freshness. In Proceedings of ACM International Conference on Management of Data (SIGMOD), pages 117-128, Dallas, Texas, USA. Cho, J. and Garcia-Molina, H. (2002). Parallel crawlers. In Proceedings of the eleventh international conference on World Wide Web, pages 124&ndash;135, Honolulu, Hawaii, USA. ACM Press. Cho, J. and Garcia-Molina, H. (2003). Effective page refresh policies for web crawlers. ACM Transactions on Database Systems, 28(4). Cho, J. and Garcia-Molina, H. (2003). Estimating frequency of change. ACM Transactions on Internet Technology, 3(3). Cothey, V. (2004). &quot;Web-crawling reliability&quot;. Journal of the American Society for Information Science and Technology 55 (14). Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt. Dill, S., Kumar, R., Mccurley, K. S., Rajagopalan, S., Sivakumar, D., and Tomkins, A. (2002). Self-similarity in the web. ACM Trans. Inter. Tech., 2(3):205&ndash;223. Eichmann, D. (1994). The RBSE spider: balancing effective search against Web load. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Edward G. Coffman, Z. Liu, R. W. (1998). Optimal robot scheduling for Web search engines. Journal of Scheduling, 1(1):15&ndash;29. Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). &quot;An adaptive model for optimizing performance of an incremental web crawler&quot;. In Proceedings of the Tenth Conference on World Wide Web: 106-113. Heydon, A. and Najork, M. (1999). Mercator: A scalable, extensible Web crawler. World Wide Web Conference, 2(4):219&ndash;229. Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L. (2005) Modeling and managing content changes in text databases. In Proceedings of the 21st IEEE International Conference on Data Engineering, pages 606-617, April 2005, Tokyo. Kobayashi, M. and Takeda, K. (2000). &quot;Information retrieval on the web&quot;. ACM Computing Surveys 32 (2): 144-173. Koster, M. (1993). Guidelines for robots writers. Koster, M. (1995). Robots in the web: threat or treat ? ConneXions, 9(4). Koster, M. (1996). A standard for robot exclusion. Lawrence, S. and Giles, C. L. (2000). Accessibility of information on the web. Intelligence, 11(1), 32&ndash;39. McBryan, O. A. (1994). GENVL and WWWW: Tools for taming the web. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Miller, R. and Bharat, K. (1998). Sphinx: A framework for creating personal, site-specific web crawlers. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Elsevier Science. Marc Najork and Janet L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the Tenth Conference on World Wide Web, pages 114&ndash;118, Hong Kong, May 2001. Elsevier Science. Nelson, M. L. , Van de Sompel, H. , Liu, X., Harrison, T. L. and McFarland, N. (2005). &quot;mod_oai: An Apache module for metadata harvesting&quot;. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005): 509. Pant, G., Srinivasan, P., Menczer, F. (2004). &quot;Crawling the Web&quot;. Web Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis, 153-178. Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. Risvik, K. M. and Michelsen, R. (2002). Search Engines and Web Dynamics. Computer Networks, vol. 39, pp. 289&ndash;302, June 2002. Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press. da Silva, A. S., Veloso, E. A., Golgher, P. B., Ribeiro-Neto, B. A., Laender, A. H. F., and Ziviani, N. (1999). Cobweb &ndash; a crawler for the Brazilian web. In Proceedings of String Processing and Information Retrieval (SPIRE), pages 184&ndash;191, Cancun, Mexico. IEEE CS Press. Zeinalipour-Yazti, D. and Dikaiakos, M. D. (2002). Design and implementation of a distributed crawler and filtering processor. In Proceedings of the Fifth Next Generation Information Technologies and Systems (NGITS), volume 2382 of Lecture Notes in Computer Science, pages 58&ndash;74, Caesarea, Israel. Springer. Retrieved from]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=172</link>
			<title><![CDATA[营销人员如何做网络营销]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Mon,13 Aug 2007 01:22:41 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=172</guid>	
		<description><![CDATA[网络营销（网站建设，网站推广）行业如今已经成为劳动密集型的企业，保险，保健品，互联网这3大行业已经成为中国最显著的劳动密集型营销特点企业。<br /><br />　　中国互联网有中国互联网的特点，整个中国互联网的优势产品都是一线的营销人员，即商务代表推起来的。他们对整个互联网做出了直接的贡献。但是互联网一线销售同样也面临很多问题。营销届同样有80/20原则，即80%的产品是20%的销售人员销售出去的，这80%的人需要是有能力的，但是笔者认为非也。一线销售是磨练人的意志的工作，心态更为重要<br /><br />　　1、客户的不理解造成的拒绝，产生的心理压力<br /><br />　　每天不厌其烦的拨打客户电话，面对客户的不理解，99%的拒绝，心理压力巨大，逐渐产生灰心的感觉。这个时候就需要有个良好的心态来调整状态。<br /><br />　　2、一直不签单，造成的生活和心理压力<br /><br />　　网络一线营销是个漫长的成长过程，可能要1个月时间，2个月，甚至3个月进行自我成长，这个时期内不签单很正常的，关键还是坚持，我发现能坚持半年以上的销售人员在未来的工作中都创造了不菲的业绩。但是初期，一直不签单面临的压力也需要自我进行调整，心态比能力还重要，要知道，在营销初期，你没有客户积累，没有人脉，遇到短暂的困难是正常的。<br /><br />　　3、满足现状，造成的堕落心理。<br /><br />　　我一直认为心态比能力重要，有的销售人员除入行，创造了不错的成绩，满足了，自豪了，在后来的几个月产生了惰性，满足现状，原地不前，被后来者超越，永不满足是人类进步的阶梯。总之，心态比能力更重要，兄弟姐妹们没坚持吧！未来在你手中！]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=126</link>
			<title><![CDATA[搜索引擎不认图片 网页设计越简单越好]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sun,10 Jun 2007 19:16:57 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=126</guid>	
		<description><![CDATA[SEO基础:搜索引擎不认图片 网页设计越简单越好
<p>　　2006年3月17日，搜索引擎观察执行总编Chris Sherman在搜索引擎战略大会上表示，很多人将网页设计得非常花哨，自以为很有创新性，却忘记了搜索引擎不能识别图片等多媒体文件，网页设计应该在突出终点的情况下，越简单越好。</p>
<p>　　Chris Sherman提醒网页设计者，如果能让搜索引擎有一个良好的识别，千万不要用太多的非文本文件的内容。</p>
<p>　　Chris Sherman介绍了在搜索引擎前沿所发生的一些我们现在主要目的放在一些领域，跨国市场，对于高层或者上流这一块，重点探讨了搜索引擎战略。针对付费和自然排名都作了比较详细的阐述。</p>
<p>　　首先看一下免费的排名，当你打入关键词的时候，可能在不同的搜索引擎提供不同的服务，但是都 是全球搜索提供的服务，关键词在左栏里面列出来，他们会被其他的因素影响，由搜索引擎来决定，哪些东西是最靠前最重要的，还有一些东西，是有机的一种排 名，无论名字怎么样变的话，都是PR，公共关系的排名，你公关做得比较好的话，可以优先排序。这是我待会要做的介绍，就是搜索引擎优化方面的课题，如果打 开一个网页的话，你可以看到两个描述的网页，这个网页是完全带着全部的内容的，但是你必须要界定一下哪些内容将会深度展开，哪些都是打出一个网页快照，我 相信所有的搜索引擎是想你的内容更加简单的传递给客户，搜索引擎的内容要对其内容进行分类，还有一些你没法做到，你必须要做很多的努力和细心的考虑。</p>
<p>　　再来看一下付费的搜索引擎，这是另外一个重要的元素，这一块有不同的名字，比如说有一些叫赞 助商的付费排名或广告商的付费排名，比如说有些时候，网友浏览点击的时候，名字不重要关键问题最终是买什么样的广告是最重要的，不像进行编辑化的搜索引擎 内容，付费的搜索引擎列表完全按照优先性进行排序，他的优势就是，你可以根据赞助商的列表进行优先排序，然后可以找到市场到底在哪里，对优先排序的话要付 多少钱。</p>
<p>　　比如说一个短信，短信来讲，可能对于整个市场来讲光提供短信来不够，你还要控制信息到底放在 哪个位置更加合适。就像PR的排名，所有的排名都是免费的，我们称之网络的优先自动排序系列，这个装饰可以使内容就像一个目录，让用户可以找到自己的内 容，他们找到内容之后，然后你要试图影响这个用户对内容的思考，他们对你内容的理解是怎样的，就像有些人在公关来讲，他们的工作就是影响舆论影响人们购买 欲望，所以说现在所做的工作就是，能够为自己的公司做一些PR，或者通过一些PR公司去做，但是必须要有方式，你必须要确保一下信息，网站提供的信息是非 常强大的信息，这也是你们公司的产品和服务的信息，而且网站不会对你们的产品和服务进行错误的理解，这样的话PR所做的公司可能会对你们公司的品牌产生负 面的影响。无论从过去还是到现在，你必须要了解对网站进行优先排序的优点，我们只是排名但没有保证，比如说你付费了我可以把你放在前十名，但是用户买谁的 产品我们不能保证，实际上没有任何一个公司。</p>
<p>　　在免费这一块讲，通过PR你可以把信息上去，但是在收费这块，没法保证把你放在最前面。如果 说你分步骤把每个人都逐渐拉入到网站，那搜索引擎实际上可以把你更加具体化，否则的话，你会被逐渐剔除出去。搜索引擎从根本上来讲，可以把他们所要的信息 进行排列，使用户对信息进行甄别，人们有些时候不断去问，如果说能够在Google达到最前线他的秘密在哪里，你们所做的工作在哪里，我觉得确实有一些魔 术，人们向找到你们的公司，这就是我们的能力所在，就是我们所做的工作所在，搜索引擎将会自动找到这个信息，而且你会从这个信息中受益，所以说不要完全依 靠这种信息，这对你们来说没有多大用处，你所需要的工作是自然的排名和付费的排名。PR公司来讲你必须要传递相应的信息，你怎样传递这种信息，创造这个网 页，同时给客户提供高端网站的内容。</p>
<p>　　简单地讲是这样，首先你们要有十个关键字到底是什么，而且这些关键字到底能够怎么样总结你们 所要做的事情，大家知道我们可能需要一千或一万个关键词，假如说你有这么多钱和时间可以搞这么多关键字，但是实际情况不是这样，有很多公司特别是小的公 司，他们不可能有一千多个关键字，他们主要是，可能把重点放到十个或几个关键字上，这些可以使大家一目了然，但是这有一个资源的问题，这些关键字不单纯是 指一个字，比如说你们用两个字，比如鞋或者跑鞋等等，实际上把鞋的描述更详细一点，就可以把相关性增加了，这样大家不管点哪个字都可以链接到相关的网站和 排表当中去，大家要集思广益，很好地考虑一下哪些是最常用的字，大家应该做一些研究，然后再搞关键字，不要觉得把这些东西弄到网上就可以了，大家可以想想 关键字应该怎么搞，你们不真正搞关键字的研究，那么你在真正付费的列表上，恐怕你最后钱花的价值就不能充分发挥出来。我想另外一点也是很重要的，充分发挥 你们内容的作用，大家可以这样尝试一下，你们和搜索引擎多了解一些，了解之后可以进一步看一看，在搜索引擎里面他到底都包含着有可能在哪个列别里，哪个细 分的列别里包含哪些内容，大家可能在开始做公司网站，或者你们在想搞付费的时候，可能就觉得就是把关键字弄出来就完了，实际不单纯是这样的，因为有些搜索 引擎可能还有一些图画，还有一些其他的内容，如果说真有这些内容的话，如果你们不真正很好地了解这些搜索引擎提供者，他们到底是提供什么样的服务和内容的 话，可能你们在真正利用搜索引擎的时候，可能就会出现一些问题。比如有些搜索引擎只能够在某些类别只能提供文本文件，有一些可以提供图片，大家总是希望把 自己的网站按照不同的目录进行分门别类，都要让他挂到搜索引擎里面去，这是一个非常好的做法。</p>
<p>　　但是假如说你们把所有你们的内容都跟搜索引擎挂上的话，最 后你的这些内容就太多了，你们不同的内容，在搜索引擎所列表的地方是不容易的，这样你要挂到搜索引擎里面去就很难了。现在大家想到了吗，在网站有上数十亿 数百亿的网站，大家都想把东西挂到搜索引擎里面去，有数百万计的网站而且还有很多的搜索类别、索引等等，怎么样解决这些问题呢，第一点，你就不在乎，大家 就竞争完了。</p>
<p>　　第二就是说，如果你能够把你自己的这些内容尽可能找一个笼统的大范围，这样就可以使别人在搜 索的时候可能把你这个内容提取出来，周韶宁刚才讲到了一个概念叫TEL ？ 那些你可以尽可能把一些内容搞得详细一些，把搜索引擎的范围划得更小，你可以增加一些大家都不用的字，实际上20%的字大家是经常用的，但是我想尽可能要 多用这些字，但是不要迷信这些20%的字，还有80%，那些字是不太常用的，可能用了他们最后能够收到你意想不到的效果。</p>
<p>　　还有一点是Crawlers，他到底怎么样发挥作用呢，这些并不是搜索引擎范围的东西，但是 这些东西，他是能够发挥他的作用的，给大家讲一点，一定要关注链接，假如你的网站没有更多的链接的话，根本就没有用，跟Crawlers根本就链不上。今 天我们讨论的话题就是讲一些链接的，Crawlers可以通过链接发挥作用，如果说Crawlers一旦找到了你自己网站，不管是文本文件还是说你的网页 等等，这样就可以使你的网站，就能让大家都看到了。实际现在网站做的东西通过Crawlers并不完全是实时的，每天可能是有数十亿的人上百亿的点击搜 索，这样完全实时的话，网络会出现一个大问题，大家就没法做了，并不完全是实时的，现在搜索引擎要做的是把字敲进去之后，就把他的搜索引擎他的目录里面的 东西给你出来，这些东西并不是真正你各个网站纯粹链接的东西，如果正是这种技术和这种做法，可以使我们搜索的时候很快。</p>
<p>　　我现在再说关于优先排序这个问题，这个问题的确是很难界定的问题，我刚才说到了一点，就是内 容是很关键的，搜索引擎就跟电脑一样，搜索引擎的确里面的程序是相对复杂的，就跟我们念书和看报纸的时候差不多，搜索引擎他也用了一种机器智能的方式，他 也用了一种叫做&ldquo;文体解读&rdquo;的办法，通过上下文的&ldquo;文体解读&rdquo;搜索出来，然后再排序我们可以通过浏览其把这些内容看出来。我一会儿再谈这些内容。</p>
<p>　　除此之外还有一点，你怎么样设计你们放到网络上与这些内容 相关的具体问题，还有相关的所谓的目录，因为在一个问题上，根据一个相关问题可以有很多很多的链接，也就是说搜索引擎他在搜索的时候，他的覆盖面是非常广 的，所以涉及这些问题的时候要慎之又慎，不要把覆盖面涉及太大，也不要忽视某些内容，搜索引擎的作用就是给大家提供这样一个功能，提出这样一个要求，比如 说你这个内容放到网上，到底他的时效是多少，比如一个饭馆他以前放到网站上的东西，但现在他已经关门了，所以我们在设计内容的时候要把有关的东西涉及到 位，这样可以使搜索引擎尽可能把你需要的内容都放上去。</p>
<p>　　另外一点，在世界上的任何一个城市，比如你要想搜索一个地方，比如这是一个比较和谐的社区， 有一些附加的内容，你们在搜索的过程当中就可以比较有针对性，换而言之，你在设计一个社区的时候，也应该考虑到搜索引擎真正的搜索方式，怎么样申请提交给 Crawlers内容呢，假如说，你要真是有很多的链接的话，你们就不用在Crawlers上花很多的时间。比如你自己就把你们网络的分布图，还有一些网 页放上去就可以了，因为你已经有很多的链接了，通过那些已有的链接，别人就可以很容易了解到你们的网页。我说这一点并不说可以使你们的排名，能够在被搜索 过程中一下排到前面，但是我想大家肯定能够了解你们的网站，当然你们也可以多做几个网页，多做几个网页之后可以增加被链接的可能性。不是说主网页下面的分 网页，不是把所有的这些网页提交给Crawlers的，可能你们在被搜索的过程当中就不太容易被搜索到。</p>
<p>　　现在我还回到关键的一点，关于内容的问题，大家想比如我们在读一些东西的时候，我们应该怎么 读，搜索引擎他们是怎么搜索，他们是把你自己网页当中的次网页都进行搜索，大家想了你网页当中哪一个是最关键的吗？所以大家要注意在你们写每一个网页的时 候，网页最关键的东西要凸显，要使他们很容易被搜索引擎读到，你们真正在搞一些关键字或者搞内容的时候，可能在某一个次网页或者主网页下面的网页里面的关 键字就没有了，这样搜索引擎就挂不到这个网页上。你们在那个网页里面可能就没有说跑鞋那个字，只是用另外一个字，这样有可能搜索引擎就把你这个网页挂不 到，你的内容就上不去，大家一定要注意你们搞的关键字都尽可能多地在网页上挂上，通过你们自己的网站也可以搜索，你们可以敲一个关键字，任何一个用户可以 打的关键字，在你自己的网站里要是出不来的话，在搜索引擎能出现你的关键字吗，相关的东西能挂上去吗，这也不可能。大家注意到你们的图，我刚才讲搜索引擎 那些图片的文件他是不可能挂到网站上去的，所以大家这一点一定要注意。也就是说搜索引擎跟盲人一样他是看不见这些东西的，它一定要有文本文件，除非那些极 特殊的有一些搜索引擎，他可能有一些图片挂上去，但是你产品的东西是挂不上去的。</p>
<p>　　我现在想关于补丁这个问题，在设计的时候一定要注意，文本文件一定要尽可能设计充分，假如说 文本文件不能做得很好的话，你就是花钱雇人也要写得好，尽可能让大家读懂，不要搞得专业性很强，也不要咬文嚼字，只要大家能够理解，具有非常强的描述性就 可以了。再一点使你的HTML使大家能够理解，不要搞一些花哨的东西，不要设计大家很难懂的东西，我再强调一点。再有，你自己要设身处地地想，你在进行文 本文件搜索的时候，你最常用的东西是什么，所以不要搞多媒体的东西，这些东西一定要能够用文字描述的，像刚才我们讲的&ldquo;跑鞋&rdquo;，你可能要加三个修饰词或者 四个，这些修饰词都能挂到搜索引擎上去，这是我讲的另一方面，要尽可能具有创造性，但是创造性不是创造那些五花八门，非文本的东西，大家要在设计的时候搞 补丁性东西的时候，尽可能能够把有可能漏掉的东西补上，这样别人在用搜索引擎的时候，你们这些内容就全都出来了。</p>
<p>　　这是Title Tag，大家可以讲到，这可能是一种秘密的武器，Title Tag在搜索引擎他也是非常非常重要的，他就像一本书的书名一样，也就是说搜索引擎他首先了解的是最大块的内容，Title Tag是非常好的、能被读出的。我想讲的意思就是说，你们在设计有关内容的时候，比如说公司名和产品的名称这些东西，这些是很重要的，但是我所说的 Title Tag他一定要是独一无二的，只要达到独一无二的特点，他才能够具有非常好的效果。</p>
<p>　　咱们说&ldquo;跑鞋&rdquo;，比如像耐克的跑鞋，你就可以把它作为一个Title Tag，耐克跑鞋的东西可以用这种方式，大家会非常感兴趣，大家一看到他就可以把这个东西搜索进来了。 </p>
<p>　　这样人们可以接通看一下你在网页上的信息，实际上题目非常重要的，题目就是对网页的描述，可 以说服别人，让别人点击进去看看这个网页上有什么信息。网站的Title Tag名字是非常重要的，在整个网页来讲的话，有些时候大家可能没有意识到这点，搜索引擎只有一页，Meta Tags，就是告诉网站信息，这样可以帮助搜索引擎或其他人了解一下这个网站是关于哪方面的，所以说雅虎也是这么去做的，基本上在搜索行业来讲是一个起步 阶段，搜索引擎没有Meta Tags，我们没法做很多的事情。还有另外一方面就是描述标签，这也是非常重要的，看一下你的目的何在，原数据的描述标签告诉你网站是关于哪方面，哪些是 有价值的，有些搜索引擎把原数据作为搜索目标，然后放在索引下面。你可以自己决定，但是千万不要误导他，有些人阅读搜索引擎可能会被误导，可能他看到这些 信息查到文章会觉得非常沮丧，先把关键词整合，这是非常重要的，而且关键词尽可能简短明了，有些关键词太长对你来说有百害而无一利，还有一个是全文的第一 段做一个Title，而且在网页来说，应该把相关的网页快照和文章放到第一个。</p>
<p>　　大家看一下题目和原数据的描述，包括跑鞋的原数据，下面我们显示出了搜索引擎找到自然搜索下 的一些结果，包括有机搜索、自然搜索，大家可以看到，我们可以利用这种信息，当你在建立网页的时候，有些人正在用这种方式搜索相应的信息，有了原数据的内 容来讲的话，我们说你可以控制这方面，但是作为我们公司我们没有保证。</p>
<p>　　还有对网页的描述也非常重要，我们提供的东西，只是一个非常起点的东西，如果说你把他作为一个自动化的工具可以解决相应的问题，因为有些时候有一些新闻简报，看到简报可以有相应的活动。由用户来决定哪些网页，由他们自己来决定哪些该去做。</p>
<p>　　如果说你不希望搜索引擎对你现在网站进行索引你怎么做，特别是是原数据的一些内容，他们如果 说不要把这些内容传递出去，我们把它称为&ldquo;原数据屏蔽系统&rdquo;，我们没有必要把这些放到网上去，供别人去搜索，最好的方式就是，如果说你购买计算机 TFIVE软件，现在非常重要的就是这种屏蔽系统是一个非常自愿的技术，这样的话有了这个技术我们就不会对你的网页进行目录的索引。有些搜索引擎可能不是 特别好，如果你有些非常敏感或私人化的信息，不要登陆公共网站，否则的话你会看到你的信息对你来说非常敏感，但是通过搜索引擎公布于众了。提到搜索引擎的 优化战略来讲的话，有些人用户浏览完Google网页又回来找相应的信息，有些时候这完全是浪费时间，没必要这么去做。我提到一点，设计方面的问题，从搜 索引擎关键词来讲可能会误导你，他不光是影响排名的先后，在你涉及到具体的网站可能会浪费很多的时间，但是他会对你的搜索引擎产生负面的影响，尤其是对动 画、网页的话会产生一些影响，我们在开发的过程当中应该避免这些问题。回到20世纪的时候，我们回到了当时的网页设计的一些问题。如果说你有一个数据库支 持你的一些网站，会产生一些相应的问题，我现在所说所做就是告诉大家，你要致力于使用这方面的信息，必须要非常谨慎。</p>
<p>　　再来看一下链接是何等重要，链接的分析是由Google所 开发的，时间是1999年左右，而且他是一个在搜索引擎行业来讲是非常大的竞争，通过链接的分析可以判断相应的文本保证网页快照等等，现在链接分析就是通 过网络的舆论，对一些网络快照进行一些分析，我们会看到越来越多的链接进入到相关的网页，现在来讲对你的私密性有一点的危害，但是有些时间你不链接别人， 但你不能控制别人链接你。今天或明天或未来，大家可以会围绕今明两天话题看一下，怎么利用链接，购买哪些相应的链接等等，这都是未来两天探讨的工作。</p>
<p>　　另外链接的质量非常重要，有时候数量不重要关键是质量，你想有些网站，而且这些网站是高质量的网站，单一的链接是一个非常重要的网站，如果说低端的、包括一些低质量的网站，或者一些质量非常次的网站的链接，一千个还不如一个高质量的网站的链接。</p>
<p>　　链接本身可以把你的重要性传递到其他的行业，他是一个非常有用的工具，他不但帮助你上网而且 可以把你让其他人了解，你的同事或者潜在的其他人或其他同事链接起来，除了一些高质量的链接，那么内容也是非常重要的，在链接的内容来讲，有两种链接，第 一有内容方面的链接，还有实际上网站的链接，就是你点击网页之后网页把你链接进去。</p>
<p>　　ANCHOR TEXT，就是你打开网页的话，可以把你网页的地址链接，ANCHOR TEXT就是要利用搜索引擎来找到相应的互联性、关联性。大家可以看到，你可以从这一块点击进去，获取这样的信息。比如像左边这个例子，你可以看到数字化 的信息，这些信息内容都是非常具体的，特别是在网页最上面必须要把它链接起来，如果说你请别人把你的网页链接的话，你必须要找到相应的关键词，重新加强你 所找到的关键词，然后放到内容里面，再找到其他网站的关键词进行链接，这样可以促进自己网站的发展。你怎样找到非常重要的网页，然后与这些网页进行链接 呢，你怎样让别人把你链接起来，我可以给大家介绍相应的技巧，但是我估计大家能够进行搜索引擎平台，最简单的链接方式。进入搜索引擎打入关键词，这样的话 网页出来，搜索引擎当前所发现的也是最有质量的高质量的网页，这些网页从理想来讲是最好的网站，这样你可以把它跟你进行链接，另外一个原因，不光是搜索引 擎可以做这样的工作。</p>
<p>　　人们在利用互联网的时候搜索引擎可以给大家提供相应的促销，而且一定要记住，在网页最上面是 人们最容易访问的网站，他们可以通过搜索引擎访问你的网页，通过访问你的网页进入其他的网页，这样来讲是一个双重的促销，还有一些非竞争性的技巧可以帮助 你进入链接，现在是越来越普及，比如像和竞争对手竞争的时候，他们可以向竞争对手推销一些链接，这不光是一种零链接他是一种合作，可以通过一个非常聪明的 方式来做到这一点，怎样做到这一点呢，非常直接首先对网页进行链接，如果发出请求&ldquo;是否你愿意跟我链接&rdquo;，然后简单的链接起来，包括自己的描述和主题词， 包括你请他们链接主页，然后把你们的主页和他们的主页进行链接，如果你告诉他为什么我跟你链接可以带来优势，比如我可以通过内容方面的促销，推销自我，给 他们很好的理由。从竞争对手来看也是一样的道路，因为这个时候他们可能有不同的立足点，从你自我角度来说，千万不要说我不愿意跟人家链接，如果别人发出邀 请链接的话，你应该非常礼貌地说我愿意。</p>
<p>　　还有三个所谓的黄金目标，第一个通过目标用户来确定网页的 链接，不要主动搞链接，与那些你所想跟踪的人进行链接，有些时候从自己网站上别人愿意销售自己的链接，这可能是所谓外交方面的流程，但是有些时候通过链接 的话，可能会使终端用户受益，千万不要犹豫，有时候不是钱的问题了，是价值的问题。为什么要链接呢，这样可以让竞争对手和用户了解你，有些时候你可以给访 问你的用户做相应的回报，你所做的是给他们额外的信息满足他们的需求。</p>
<p>　　再来看一下搜索引擎的广告，我想一共有三个主要的方面，大家知道。如果你们的所谓的公共关系 做得不好，大家可能就不一定用你们这个来登广告，你们这个公司就不大可能，再有另外一个是内容的问题，如果内容不行的话，你最后挣不到钱也不可能，如果要 是能够很好地为你们做广告的话，我想一个公司他要很好让大家得到链接的话，你一要要做很好的公共关系，而且广告也是必不可少的。</p>
<p>　　大家再看另外一个分析，关于付费列表的内容，给钱越多，你的排位就越高，Google就是这 样一个例子，但是最关键的一点，钱和排序成正比，这还有一个问题，另一方面也就是说，是不是很多的人都愿意搜索你，越多的人搜索你的排位就越靠前，大家还 要注意一点，人们搜索到你的网站的时候，并不单纯是想看到你的主页而且还想看到其他的内容，这些也和付费紧密相关的，也就是说人们登陆了之后，还会接着往 下看，我想真的花钱的话要在Google和雅虎那儿交钱，我想在那儿购买那种服务，这样的话，不管谁要想搜索你的公司，他就可以很好得到你们公司的有关的 信，在中国百度是一个很好的搜索引擎，当然现在其他的公司也很好大家也可以考虑。</p>
<p>　　概括一下，我们讲的是进行网络搜寻到底是怎么样一回事，到底是付费还是不付费，另外还有关于 纵向业务发展、纵向搜寻的问题，这也是在今后五年当中这也是一个很重要的领域，今天我也问问了有关的发言人他们的观点，他们认为这个领域里的业务可能今后 发展会非常快，今后大家可能更关注的是那些一下就能捕捉得着的信息。全球范围是怎么一个情况呢，现在大家要搞的是B2B，还有购物，还有领域等等，这些方 面都可以在今后若干年通过搜索引擎得到更大的发展。实际我们现在不单纯是通过搜索引擎吸引很多的浏览，今后的纵向的发展也是很好的省钱的办法。</p>
<p>　　在全球范围内来讲，现在我们可以看到有很多的网站，而且每天我们可以看到有很多的搜索引擎在 帮助人们搜索各种网站，在全球内比如像Google、雅虎、MSN这些引擎都是非常好的，但是在中国是百度非常好，这是基于我的研究得出的结论，我们可以 看一看，现在哪些公司是最有实力的，而且这些搜索引擎之间也有自己相互的关系，而且有些服务是相互重叠的，但是我想不管是在一个国家范围内，还是在一个网 站这方面来说，大家都可以建立自己的链接，不管是Google还是什么其他的搜索引擎，他们都建立了一种协议的关系，MSN他现在也用雅虎的链接等等等 等。当然大家有的人还没有甬道MSN这个引擎，我觉得大家不妨可以尝试一下，他也是很好的一个搜索引擎。</p>
<p>　　Google情况是怎么样呢！我们可以看出这是 Google的一些情况，也就是说Google在全球范围内很多人都用这个引擎，他应该是趋于非常高的排位，还有一个，是AOL Search，他在全球覆盖面也是非常广，很多人都在用，大家如果想尝试一下也可以，假如说月AOL和Google都进行搜索的话结果有的时候可能会不一 样，因为时代华纳，他们在Google里面放了很多的内容。还有一个搜索引擎是Ask，现在他们也跟Google有合作关系，而且他们自己也购买了相应的 技术，如果大家要想尝试一下你可以使用一下，而且Ask现在也在不断发展，你们要看他最近的业务战略，你们就会了解到这一点，他们现在在中国已经建立了研 发中心了，所以大家可以尝试一下。</p>
<p>　　雅虎现在他们也有很多人是他们的用户，而且雅虎他们还有其他的一些服务，我说的不仅是在中国 很好，而且在全球都做得非常好，他们还有所谓付费的服务，我刚才也说了。如果你要真付费的话，不一定非得你的排序就一定靠先，你要在其他引擎付费他不一定 马上就能搜索到你的内容，但是在雅虎他可以保证你付费的话，有人要搜索一定搜索到你那儿，但是不一定你的排位是非常高。实际我想付费的这项服务，跟搏彩或 者买彩票一样，有可能你付费之后你自己的这些内容可以被包括进来，但是不一定非常靠前，大家感觉这一点很难接受。现在把这个内容跟大家再做一些总结，我刚 才讲的这些内容已经很多，大家感觉有些内容有道理，有些东西大家可能很难接受，但是这些内容是实实在在，如果大家有什么问题的话，你们可以跟我们进行沟通 也可以跟业界进行沟通，大家现在可能还发现了另外一点，不一定要完完全全地相信这个排位是唯一的表达方式，不一定是这样的。搜索引擎他自己的相关性是不断 在变化的，也就是说你付了费之后，你今天排这么高，也不一定今天你排这么高明天还排这么高，不一定了。搜索引擎内部也要进行调整，除了相关性这一条，他的 技术也要调整，所以不能单纯迷信你花了钱排位一定要靠前。今后五年纵向的业务发展也是很关键的，我们还有其他的两位的发言人，我们赶紧把我自己的笔记本电 脑跟他换一换，然后别人再讲。 </p>]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=124</link>
			<title><![CDATA[网络营销策略与搜索引擎优化]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sun,10 Jun 2007 19:06:06 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=124</guid>	
		<description><![CDATA[如果你想把业务转到在线销售上，那么恭喜你，你已经认识到网络是你不可忽略的。在把业务转到在线生意之前，你还有一点必须明白：能在这个领域取得多大的成就取决于你理解了网络营销和怎么把你的产品呈现给客户。不幸的是，很多人进入在线生意之前没有做市场调查或者做好计划。
<p>　　传统的市场营销目标是把合适的产品以合适的价格出现在合适的地方。正确的营销推广组合方式是把你的产品/服务信息呈现给正确的人(目标受众)。在你进行实际操作之前，你应该把这些想法成文，撰写成可执行的市场营销计划。</p>
<p>　　网络营销效果应该是围绕建立你的品牌而进行的。不论你的品牌是来自于你的产品或是你的服务，都必须建立起客户信任度。而与客户建立关系就是让客户和你的网站间树立起信任和商业信誉。你的品牌是你和竞争对手的一个很大的区别。必须以客户的角度看看你的网站和你的品牌，他们是否能给你产生信任感？请记住，即使搜索引擎搜索结果高排名也不会给你任何商业的信誉，他们只是让客户更快找到你。</p>
<p>　　营销的唯一目标是达成销售。在网络营销领域，这是很容易被忽略的，因为有太多需要操心的，搜索引擎结果排名，网站流量，点击率，回报率等等。</p>
<p>　　搜索引擎优化的目标是让您能够在搜索引擎结果中更容易被找到。但这并不能保证销售，光有曝光率是不能产生销售结果的，还必须有个强大的市场营销活动帮助促进销售，如果没有产生销售收入，排在搜索引擎第一名对你的生意而言毫无意义。</p>
<p>　　搜索引擎优化是与你的整个市场营销策略之一的付费广告营销一起为公司品牌营销服务的。自然搜索结果可以看成是社会公关。你的业务应该是集中在市场营销，比如增加客户而不是访客。</p>
<p>　　独立的搜索引擎优化不能销售你的产品或者服务，优化可增加目标访客，当然只有目标访客不能保证销售，只有你的市场营销能够为你的生意带来销售增加利润。</p>
<p>　　怎么做才能有助于你的在线业务？</p>
<p>　　记住只有价格策略也不能带来销售。价值驱动销售，你的品牌决定你的价值。如果你建立了重要而且独特的品牌，客户将记住你的品牌，并回来购买你的产品。因此请建立你的品牌。</p>
<p>　　你与竞争者的区别。客户在购买某个产品的时候都会货比三家，谁能脱颖而出关键在于产品描述，客户服务，客户体验还有网站印象，如果你的网站看上去在某个领域很专业，这有助于帮助将增强你的品牌价值。</p>
<p>　　确保你的网站是专注于客户和提升客户体验，很多网站只注重公司和自己的产品。通常客户在网站上的看到的与企业自己所的理解有很大不同。我们应尽量避免建设这么一个基于说明书似的的企业网站。</p>
<p>　　网站要怎样做才能让你的客户关注你？应该是快速载入、方便导航的，目标客户相关内容的，并且经常更新的内容。这些内容应有助于提高信任度和信誉。以及帮助人们如何衡量人们使用你的网站并提升业绩(销售)。</p>
<p>　　网站的成功是建立在健全的市场营销策略而不是搜索引擎流量。搜索引擎优化应该是整个营销策略之一，而不是你的唯一营销策略。我们的网站应该是客户友好而不仅仅是搜索引擎友好，我们应创建独特的品牌价值从而排除其他竞争对手。</p>]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=123</link>
			<title><![CDATA[Bill Hunt说如何成为优秀的SEM/SEO]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Sun,10 Jun 2007 18:55:58 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=123</guid>	
		<description><![CDATA[要成为优秀的SEM，应该具备一下几个特质：
<p>　　1. 他们必须喜欢刨根问底，有热情去探寻每个页面是为什么没有排名，并且能够深入地去分析数据，找寻哪些影响表表现的因素。</p>
<p>　　2. 他们必须富有竞争精神，渴望为自己和客户赢得一切。他们不应该为现有好排名和流量所迷惑，而应该随时保持不断提升表现的欲望。</p>
<p>　　3. 他们必须对这个行业有热情。如果他们只是把这个看成和其他的工作一样， 就不可能成功。所有的知名搜索专家都对这个行业充满激情。不断追求更高排名、更频繁收录并且竭尽所能来寻求最好的方式突破困境为客户获取最大的价值。</p>
<p>　　4. 他们必须始终保持学徒心态，也就是说，这个行业学无止境。应该时时学习搜索行业知识，并且学习心理学、营销学、改变管理观念等等，才能保持良好的学习曲线。</p>]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=119</link>
			<title><![CDATA[如此向不懂SEO的人解释什么是搜索引擎优化]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Tue,29 May 2007 11:35:37 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=119</guid>	
		<description><![CDATA[<p>SEO就是：通过把你的网站结构调顺、把内容组织得更加规范等，让你的网站通过搜索引擎自然检索长期带来源源不断的访问量。</p>
<p>如果还听不明白，就这么理解：把你的网站修改得让搜索引擎更喜欢，搜索引擎一高兴，就把你的网站放前面了。你网站访问量大了，潜在客户多了，转化为客户的人就多了。</p>]]></description>
		</item>
		
			<item>
			<link>http://www.wenlei.net/default.asp?id=85</link>
			<title><![CDATA[企业网站如何才能真正出效益]]></title>
			<author>wenlei@vip.qq.com(闻雷)</author>
			<category><![CDATA[网络营销战略]]></category>
			<pubDate>Thu,08 Mar 2007 22:39:21 +0800</pubDate>
			<guid>http://www.wenlei.net/default.asp?id=85</guid>	
		<description><![CDATA[尽管目前企业的多数业务仍主要依靠传统的业务方式开展，但是，欲通过互联网形式参与市场竞争的企业越来越多，这是信息时代经济发展的必然趋势。那么，到底如何建设企业网站，特别是中小企业网站，使网站真正为企业发挥效益呢？
<p>　　网站是企业信息化建设的重要组成部分，是企业展示形象和实力的窗口。企业上网不是摆设，而是要从中获得效益。网站的制作不能只求美观，盲目攀比，而是要根据企业经营的需要，构造适合自身特点的上网计划和模式，以最小的投入换取最大的回报。</p>
<p>　　近年来，企业建立网站成为启动信息化工程的第一步。经过几年的建设，已有相当一部分企业在网上安了&ldquo;家&rdquo;，有了自己的企业宣传网站。然而，一个不容忽视的现象是：建设了网站的企业，有些尚能用其发布信息或更新企业的产品，有些则在制作完成并新鲜一段时间后就再也不闻不问，网站基本上成了&ldquo;聋子的耳朵&mdash;&mdash;摆设&rdquo;。纵观这些成为&ldquo;摆设&rdquo;的网站，通常存在如下问题。其一，网站规划设计不全面；其二，网站制作形式不科学；其三，网站宣传推广不到位；其四，网站维护管理不规范。</p>
<p>　　尽管目前企业的多数业务仍主要依靠传统的业务方式开展，但是，欲通过互联网形式参与市场竞争的企业越来越多，这是信息时代经济发展的必然趋势。那么，到底如何建设企业网站，特别是中小企业网站，使网站真正为企业发挥效益呢？</p>
<p>　　明确网站建设的&ldquo;理由&rdquo;</p>
<p>　　明确企业网站建设的意义。企业建网站不是为了赶一时的潮流或是博取一个好名声，而是要通过互联网这个全球性的网络来宣传企业、开拓市场，同时，降低企业的管理成本、交易成本和售后服务成本，并通过开展一系列的电子商务活动获得更多的利润，这些均与企业的经营目的是一致的。所以，只有把信息技术同企业的管理体系、生产流程和商务活动紧密结合起来，才能正确地建设和维护这个网站，并使网站发挥作用，为企业服务。</p>
<p>　　组建好一支队伍</p>
<p>　　确定网站建设与管理人员。对于大型企业来说，可以设专门的部门或CIO（首席信息执行官），总体负责企业的信息化发展规划。但对于中小企业来讲，单设CIO会有一定的困难，可以设兼职CIO。<br />　　<br />&nbsp;&nbsp;&nbsp; CIO不仅负责企业网站的规划、建设、管理与更新维护，而且负责企业信息化发展规划的制定、普及企业上网知识、组织人员对传统企业的管理模式、生产模式等进行信息技术改造。企业是单纯做一个网站进行宣传好呢，还是结合企业内部业务开展全面的电子商务好？类似这样的问题，CIO可以为企业做出符合自身发展的信息化建设的最佳方案。</p>
<p>　　形成自己的特色</p>
<p>　　网站内容及制作形式要有特色。&ldquo;千篇一律&rdquo;的企业网站制作模式是可以参考的，但不能不加思考地照搬照抄。要结合企业自身的特点进行适当地改造。改造的做法很多，在此，仅对一些共性问题进行探讨，供企业在网站建设时参考。</p>
<p>　　一是首页设计要简洁。没有必要做成大篇幅的动画，并非所有上网的人都能正常浏览动画，而且动画下载要占用大量的时间，尚未看到具体的内容之前就让人失去了耐心，这样做有失网站建设的初衷。</p>
<p>　　二是企业介绍要全面。要从企业的历史、发展、规模、优势、特色、社会地位、媒体评价、荣誉及诚信等方面，多层次多角度进行包装，配以照片进行宣传。需要注意的是，对于企业理念之类的面向企业内部管理的内容，不要介绍得过分详细。</p>
<p>　　三是产品及服务内容要详细。要将产品的名称、主要成分、作用、适用范围、照片、商标、报价、售后服务等与客户或消费者在使用本产品时涉及的有关材料详细地放在网上。有些企业出于竞争或保密等原因，不愿将其产品报价在网上注明，这样做不一定有利。笔者认为，多数产品不仅要提供报价单，还应该提供网上订购功能。如在介绍产品详细资料处设一个&ldquo;哪里可以看到/购买&rdquo;此产品的链接，这样会收到意想不到的效果。</p>
<p>　　四是要提供联系方式。最好能将有关部门及有关员工如业务部门、质量管理部门、区域销售中心、分销商、售后服务部门等的联系方式都放上去。有时日常的联系也是商业机会的开始。</p>
<p>　　五是要开设交互功能。提供网上交互功能，可以让访问者给你提反馈建议，交互时最重要的是要实事求是地注明响应时间，便于留言者有计划地访问网站，不会由于多次查看得不到答复而失去对网站的信任。如有可能，将前期的回复一并放到网站上，供后来者参考。</p>
<p>　　六是要具备下载和打印功能。对于企业来说，可公开的产品的照片、表单、说明书等资料，最好做到可以从网上下载和打印，便于访问者在网下研究企业的产品，增加商业机会。对于其它的栏目，企业可根据自身的需求进行设计。需要注意的是，真正有意向开展合作的人，注重的是实效，而不会对网站好看与否评头论足。当然并非说美工不重要，恰如其分地表现才是合适的。</p>
<p>　　&ldquo;创&rdquo;出网站的招牌</p>
<p>　　注重网站的宣传推广。企业网站建成之后，宣传很重要，网站的网址、邮箱是宣传网站的基本要素。一般的做法是：注册搜索引擎，包括网络实名等。但仅仅做到这一点还不够，还应该在企业的总体包装，名片、信件、企业宣传手册、产品宣传手册、使用说明书、各种广告宣传材料的显要位置标识企业网站的网址。企业的领导尤其应该以身作则，带头推广。如果企业的销售或是服务对象是在全球范围内，还需要根据产品的市场布局合理地设计多种语言版本。</p>
<p>　　做好运维管理</p>
<p>　　加强网站的管理和维护。网站建成后，管理和维护非常重要，包括动态信息更新、新产品更新、咨询回复、网站安全等。企业的领导要对网站的管理倍加重视。企业CIO（包括兼职的CIO）要切实负起责任，制定网站管理与日常维护更新制度，落实考核与奖惩办法，建立信息更新渠道，确保网站发挥作用。</p>
<p>　　总之，　企业上网不是摆设，而是要从中获得效益。网站的制作不能只求美观，盲目攀比，而是要根据企业经营的需要，构造适合自身特点的上网计划和模式，以最小的投入换取最大的回报。只有把网站做成企业和客户之间的有效纽带，网站才能真正发挥作用。另外，企业的网站还应该多关注自己特定的客户群，通过多种形式和客户保持沟通，吸引客户不断地通过网站和企业进行交流，从而加深与客户的关系、更深层次地了解客户需求、为企业发展提供服务。&nbsp;&nbsp;<br /></p>]]></description>
		</item>
		
</channel>
</rss>