About External Site Catalog
Index and search external sites in a Plone site
External Site Catalog has been designed by the INGENIWEB team.
External Site Catalog 1.2.0 is licensed under the GNU GPL license.
Screenshots of the administration interface.
Management screen of an ExternalSite task
Statistics after having crawled a site
ExternalSiteCatalog is a web crawler that can index external sites and make them searchable in Plone. You can specify the sites to index in a Plone Configlet, and directly index them from Plone, or let a scheduler do the job. Have a look at some of the screenshot in the doc folder of the product to get a first impression of what it looks like. Searching the external sites is done in a special portlet that is installed with ExternalSiteCatalog. External sites are not searchable in the normal Plone catalog, but are only available in a separate catalog in the portal_externalcatalog tool.
Direct indexingAll external sites are configured in the ExternalSiteCatalog configlet. If you want to index external sites immediately, you can do so after defining all the parameters, selecting your site, and clicking on the "index" button in the configlet.
Link Exclusion and Inclusion
Links can be included or excluded using lists of regular expressions.
Some hints if you are not accustomed with regular expressions:
Read the Regular Expression HOWTO by A.M. Kuchling
Check out Kodos, the Python Regular Expression Debugger:
Crawling only certain links
If you need to only index certain pages in the external site, use list of "Path inclusion regular expressions" property in the task definition.
Using the following regular expression, only the documents starting at the docs folder will be indexed.
You can specify as many regular expressions in the list as you like. Only if at least one of the regular expression matches, the link will be followed.
Have a look at the statistics of indexing in the overview page of the ExternalSiteCatalog configlet. Included links will appear in a special section.
Excluding certain links
If you need to exclude certain links from being indexed, use list of "Path exclusion regular expressions" property in the task definition.
Using the following regular expression, links ending in "folder_contents" will be ignored.
You can specify as many regular expressions in the list as you like. If one of the regular expressions matches your link, it will be ignored.
Have a look at the statistics of indexing in the overview page of the ExternalSiteCatalog configlet. Excluded links will appear in a special section.
Crawling external sites regularly
If you want external sites to be crawled regularly every day or month, you'll have to do some extra work. Make sure to install PloneMaintenance starting from version 1.3. Make sure that you are also regularly calling PloneMaintenance from cron or one of the Zope schedulers.
Follow the installation instructions in PloneMaintenance!
Have a look at the portal_maintenance tool in the Zope Management interface! It contains a lot of useful information!
Console indexing utility
The console indexing utility is a bit more complicated than doing everything from Plone. In many cases you don't have the resources to administrate an external utility, but if you can afford it, this tool gives you the possibility to decouple the long running work of fetching external sites from Zope. Basically, you can be sure that there is no Zope thread blocked with crawling external sites for a long time. Zope is only called from the external tool when it should index a page, so there is still some load on the Zope server!
Please note that this external tool is not making use of the information entered in the Plone configlet. It is completely independent! It also does not make use of PloneMaintenance, and it is up to you to configure and call it with a scheduler like cron.
In the intent to avoid running long Zope transaction while browsing an external site, the indexing is driven from the console utility '.../ExternalSiteCatalog/bin/indexexternalsite.py'. Just cd there and type this for hints:$ python indexexternalsite.py -h
Querying an ExternalSiteCatalog object
Querying an ExternalSiteCatalog works just like querying an usual ZCatalog. Its indexes are:
- 'PrincipiaSearchSource' (ZCTextIndex, HTML friendly)
- 'hostname' (FieldIndex)
Its metadata are:
- 'url', the URL to the page
- 'title', HTML title of the page when found
When querying an ExternalSiteCatalog, it acts just like a ZCatalog, that has the traditional 'PrincipiaSearchSource' ZCTextIndex and 'url' and 'title' metadata. Note that 'title' may be empty since external pages may not have a title.
This is the simplest template for querying an ExternalSiteCatalog and displaying results:<html> <head> <title>Searching other sites</title> </head> <body tal:define="catalog nocall: container/yourExternalSiteCatalog"> <h2>Searching</h2> <form action="#" tal:attributes="action template/absolute_url"> Search: <input type="text" name="PrincipiaSearchSource" /> <br /> In: <select name="hostname"> <option value="">--Any--</option> <option tal:repeat="item python:catalog.uniqueValuesFor('hostname')" tal:content="item"> option </option> </select> <br /> <input type="submit" value="Search" /> </form> <hr /> <h2>Results</h2> <tal:block define="pss request/form/PrincipiaSearchSource; hostname request/form/hostname | nothing; results python: catalog(PrincipiaSearchSource=pss, hostname=hostname)"> <div tal:repeat="result results"> <a href="#" tal:content="result/title | result/url" tal:attributes="href result/url"> Some result </a> </div> </tal:block> </body> </html>
Of course, you should replace 'yourExternalSiteCatalog' by the name of your own ExternalSiteCatalog. And this sample is easy to translate to DTML.
- The "Ingeniweb":http://www.ingeniweb.com team.
- Using a customized version of the fantastic "HarvestMan":http://harvestman.freezope.org (thanks to Anand B Pillai).
- Amine Mohamed Soulaymani ("email@example.com":mailto:firstname.lastname@example.org).
- Maik Röder - Direct indexing in Plone, configlet, unit tests, functional tests, cleanup, request debug logging
Released versions of External Site Catalog are available here. The current version is 1.2.0.
The repository contains the up-to-date versions of our source code. In order to get the HEAD branch of ExternalSiteCatalog, use :
cvs -z3 -d:pserver:email@example.com:/cvsroot/ingeniweb co ExternalSiteCatalog
You can also browse the CVS with your browser.
Please take time the read the Readme
External Site Catalog changes
- Catalog should be created like the Plone Catalog. Look especially at Lexicon and Splitter configuration
- It should be possible to define a maximum depth
- It should be possible to do incremental indexing. Everything is prepared for this. Just need to pass stats stored in the task to the indexer.
- Fix encoding issues
1.2 - 2006/12/12
Included and excluded urls can now be configured in the configlet using regular expressions. This is extremely useful if you want to include only pages starting under a certain folder. The regular expression for inclusion of the "docs" folder looks like this:
Or if you want to exclude a url, like the link to folder_contents to be excluded from the catalog:
Renamed the skin layer "externalsitecatalogskins" to "externalsitecatalog"
config.py now contains an option "debug_requests" for a debug mode that can be useful if you want to track the requests emitted by ExternalSiteCatalog.
1.1.2 - 2006/08/01
- Fix import problem. Installation scripts were wrongly imported from PloneSubscription.
1.1.1 - 2006/07/27
- Remove _post_init call on ExternalSiteCatalogTask because it has been removed from PloneMaintenanceTask class since version 1.4
1.1 - 2006/07/12
- Add a new lexicon : better for latin languages.
- Fix a template bug on external_catalog_search.zpt
- Change Charset strategy detection when indexing pages. Get charset in meta tag "Content-Type" if not found in http headers.
1.0 - 2006/06/02
- Initial release
External Site Catalog ChangeLog is also available for detailed informations.