当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫),我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。
在nutch-default.xml中有5项配置是和User-Agent相关的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | < property > < name >http.agent.description</ name > < value ></ value > < description >Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </ description > </ property > < property > < name >http.agent.url</ name > < value ></ value > < description >A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </ description > </ property > < property > < name >http.agent.email</ name > < value ></ value > < description >An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </ description > </ property > < property > < name >http.agent.name</ name > < value ></ value > < description >HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </ description > </ property > < property > < name >http.agent.version</ name > < value >Nutch-1.7</ value > < description >A version string to advertise in the User-Agent header.</ description > </ property > |
在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:
1 2 3 4 5 | this .userAgent = getAgentString( conf.get( "http.agent.name" ), conf.get( "http.agent.version" ), conf.get( "http.agent.description" ), conf.get( "http.agent.url" ), conf.get( "http.agent.email" ) ); |
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | private static String getAgentString(String agentName, String agentVersion, String agentDesc, String agentURL, String agentEmail) { if ( (agentName == null ) || (agentName.trim().length() == 0 ) ) { // TODO : NUTCH-258 if (LOGGER.isErrorEnabled()) { LOGGER.error( "No User-Agent string set (http.agent.name)!" ); } } StringBuffer buf= new StringBuffer(); buf.append(agentName); if (agentVersion != null ) { buf.append( "/" ); buf.append(agentVersion); } if ( ((agentDesc != null ) && (agentDesc.length() != 0 )) || ((agentEmail != null ) && (agentEmail.length() != 0 )) || ((agentURL != null ) && (agentURL.length() != 0 )) ) { buf.append( " (" ); if ((agentDesc != null ) && (agentDesc.length() != 0 )) { buf.append(agentDesc); if ( (agentURL != null ) || (agentEmail != null ) ) buf.append( "; " ); } if ((agentURL != null ) && (agentURL.length() != 0 )) { buf.append(agentURL); if (agentEmail != null ) buf.append( "; " ); } if ((agentEmail != null ) && (agentEmail.length() != 0 )) buf.append(agentEmail); buf.append( ")" ); } return buf.toString(); } |
在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:
1 2 3 4 5 6 7 8 | String userAgent = http.getUserAgent(); if ((userAgent == null ) || (userAgent.length() == 0 )) { if (Http.LOG.isErrorEnabled()) { Http.LOG.error( "User-agent is not set!" ); } } else { reqStr.append( "User-Agent: " ); reqStr.append(userAgent); reqStr.append( "\r\n" ); } |
通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser):
1、模拟Firefox浏览器:
1 2 3 4 5 6 7 8 | < property > < name >http.agent.name</ name > < value >Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</ value > </ property > < property > < name >http.agent.version</ name > < value >20100101 Firefox/27.0</ value > </ property > |
2、模拟IE浏览器:
1 2 3 4 5 6 7 8 | < property > < name >http.agent.name</ name > < value >Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</ value > </ property > < property > < name >http.agent.version</ name > < value >6.0)</ value > </ property > |
3、模拟Chrome浏览器:
1 2 3 4 5 6 7 8 | < property > < name >http.agent.name</ name > < value >Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</ value > </ property > < property > < name >http.agent.version</ name > < value >537.36</ value > </ property > |
4、模拟Safari浏览器:
1 2 3 4 5 6 7 8 | < property > < name >http.agent.name</ name > < value >Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</ value > </ property > < property > < name >http.agent.version</ name > < value >534.57.2</ value > </ property > |
5、模拟Opera浏览器:
1 2 3 4 5 6 7 8 | < property > < name >http.agent.name</ name > < value >Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</ value > </ property > < property > < name >http.agent.version</ name > < value >19.0.1326.59</ value > </ property > |
后记:查看User-Agent的方法:
1、
2、
3、