tiddlers: Nutch入门

This data as json

title	meta	text	revision
Nutch入门	{"created": "20230601085513660", "creator": "root", "tags": [], "title": "Nutch\u5165\u95e8", "modified": "20230610090702583", "modifier": "root", "type": "text/vnd.tiddlywiki", "revision": "5"}	!! 下载 Nutch 从 Nutch 官方网站下载最新的稳定版本，并在您的系统上解压缩它。您可以使用以下命令来下载和解压 Nutch： ``` wget http://mirror.olnevhost.net/pub/apache/nutch/1.19/apache-nutch-1.19-bin.tar.gz tar xvzf apache-nutch-1.19-bin.tar.gz ``` !! 配置 Nutch 在 Nutch 目录中，使用文本编辑器打开 conf/nutch-site.xml 文件，并更新以下配置项。 ``` <configuration> <property> <name>http.agent.name</name> <value>YourWebCrawler</value> </property> <property> <name>http.robots.agents</name> <value>YourWebCrawler,*</value> </property> <property> <name>http.agent.host</name> <value>YourWebCrawler.com</value> </property> <property> <name>plugin.includes</name> <value>protocol-http\|urlfilter-regex\|parse-(html\|tika)\|index-(basic\|anchor)\|query-(basic\|site\|url)\|response-(json\|xml)\|summary-basic\|scoring-opic\|urlnormalizer-(pass\|regex\|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins.</description> </property> <property> <name>http.content.limit</name> <value>1000000</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> </property> <property> <name>solr.server.url</name> <value>http://localhost:8983/solr/mycore</value> </property> <property> <name>solr.core.name</name> <value>mycore</value> </property> </configuration> ``` 这些设置将让 Nutch 在爬取网站时遵循 robots.txt，限制 http 的内容长度，并忽略外部链接。 !! 配置 Schema.xml 在 conf/schema.xml 文件中，您可以配置要在 Solr 中索引的字段。可以根据需要自定义配置。 !! [[安装 Solr]] {{安装 Solr}} !! 开始爬取 ``` bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/nutch ./apache-nutch-1.19/bin/crawl -s urls/ -i http://localhost:8983/solr/news_core ./crawl/ ``` 其中，``urls`` 为 URL 列表的文件名，``crawl`` 为 Nutch 工作目录的名称，``http://localhost:8983/solr/nutch 为`` Solr 的 URL。最后，您可以在 Solr 中查看索引数据。这条命令可以用： ``` $ nutch/bin/crawl -s urls/ crawl/ 1 $ nutch/bin/crawl -s urls/ -i --size-fetchlist 10 crawl/ 1 ```	8