About Content Crawlers

Create a crawler to import content into your portal from external content repositories. You must run a job associated with the crawler to periodically search the external repository for content and import that content. For information about jobs, see About Jobs.

Note: Crawlers depend on data sources. For information on content sources, see About Content Sources.

This topic discusses the following information:

Web Crawlers
Remote Crawlers
Content Crawler Web Services
Importing document security
Troubleshooting the results of a crawl

To learn how to create or edit administrative objects (including crawlers), click here.

Web Crawlers

A Web crawler allows users to import content from the Web into the portal.

To learn about the Web Crawler Editor, click one of the following editor pages:

Choose Content Source
Main Settings
Web Page Exclusions
Target Settings
Document Settings
Content Type
Set Job
Advanced Settings
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

Remote Crawlers

A remote crawler allows users to import content from an external content repository into the portal.

Some crawl providers are installed with the portal and are readily available to portal users, but others require you to manually install them and set them up. For example, Plumtree provides the following crawl providers:

Windows NT File (included with the portal software)
Documentum
Microsoft Exchange
Lotus Notes
Novell

Note: For information on obtaining crawl providers, contact Customer Support. For information on installing crawl providers, refer to the Installation Guide for Plumtree Corporate Portal or the documentation that comes with your crawl provider, or contact your portal administrator.

To create a remote crawler:

Install the crawl provider on the portal computer or another computer.
Create a remote server.
Create a Content Crawler Web Service (discussed next).
Create a remote content source.
Create a Remote crawler.

To learn about the Remote Crawler Editor, click one of the following editor pages:

Choose Content Source
Main Settings
Document Settings
Content Type
Set Job
Advanced Settings
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

The following crawl providers, if installed, include at least one extra page to the Remote Crawler Editor:

Windows NT File (included with the portal software)
Documentum
Microsoft Exchange
Lotus Notes
Novell

Content Crawler Web Services

Content Crawler Web Services allow you to specify general settings for your remote content repository, leaving the target and security settings to be set in the associated remote content source and remote crawler. This allows you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings.

Note: You create Content Crawler Web Services on which to base your remote content sources. For information on content sources, see About Content Sources.

To learn about the Content Crawler Web Service Editor, click one of the following editor pages:

Main Settings
HTTP Configuration
Preferences
User Information
Advanced URL Settings
Advanced Settings
Properties and Names
Security (only available when editing an object)
Migration History and Status (only available when editing an object)

Importing Document Security

Users can automatically be granted access to the content imported by some remote crawlers. The Global ACL Sync Map shows these crawlers how to import source document security.

For an example of how importing security works, click Importing Security Example.

Troubleshooting the Results of a Crawl

You should check the following if your crawler does not import the expected content:

Make sure your folder filters are correctly filtering content. To learn about testing your filters, see the Testing Filters section on the Main Settings (Filter) page.
Make sure your crawler did not place unwanted content into the target folder. If a document does not filter into any subfolders, your crawler might place the document in the target folder. This is determined by a setting on the Main Settings page of the Folder Editor.
Make sure the crawler did not place content into the Unclassified Documents folder. If a document cannot be placed in any target folders or subfolders, your crawler might place the document in the Unclassified Documents folder. This is determined by a setting on the Advanced Settings page of the Crawler Editor. If you have the correct permissions, you can view the Unclassified Documents folder when you are editing the Knowledge Directory or by clicking Administration | Select Utility | Access Unclassified Documents.
Make sure you have at least Edit access to the target folder.
For Web crawlers, make sure the robot exclusion protocols or any exclusions or inclusions are not keeping your crawler from importing the expected content. This is determined by a setting on the Web Page Exclusions page of the Crawler Editor.
Make sure the authentication information specified in the associated content source allows the portal to access content.
Review the job history for additional information.