Advanced Settings (Crawler)

To specify the language of content, what to do with rejected documents, and a crawler tag:

Under Content Language, in the drop-down list, choose the language in which the majority of content that you want to import is written.
Under Rejected Documents, specify what to do with documents that do not successfully sort into a folder:

To import these documents anyway, choose Import into the Unclassified Documents folder.

Note: The Unclassified Documents folder is available to users with access to unclassified documents. To access unclassified documents, in the Directory menu, click Edit Directory and open the Unclassified Documents folder. You can also click Administration | Select Utilities | Access Unclassified Documents.
To avoid importing these documents, choose Do not import.

If you are editing an existing crawler, you see the section Importing Documents. Under Importing Documents, specify whether to import only new documents. By default, this crawler attempts to import only new documents (those that have not been previously imported by this crawler or other crawlers that access this same content source). You can change the crawler setting to import multiple copies of each document, which might be useful while testing your crawlers.

To import only new documents, select Import only new links and new options display; otherwise, skip to Step 4.
To specify what new links means:

To import only those documents that have not been previously imported by this crawler, choose by this Crawler.
To import only those documents that have not been imported from this crawler's content source (either by this crawler, another crawler, or manually by a user), choose from this Content Source.

Note: The option you choose here affects your actions in Step 3f and Step 4.

To refresh the previously imported documents as specified on the Document Settings page, select refresh them. Generally, refreshing documents is the job of the Document Refresh Agent; refreshing documents slows the crawler down. However, if you changed the document settings for this crawler or changed the property mappings in the associated content types, refreshing documents updates these settings for the previously imported documents.
If you created additional folders or applied different filters to destination folders, select try to sort them into additional folders to sort the previously imported documents to new Knowledge Directory folders.

Another crawler might have imported documents from the same data source but into different folders than the destination folders specified for this crawler. Make sure you really want to re-sort those documents into the destination folders specified for this crawler.
To re-import documents that were previously deleted (manually, due to expiration, or due to missing source documents), select regenerate deleted links. This might re-import documents that were at one time deemed inappropriate for your portal.
If absolutely necessary, you can delete the record of documents that have been deleted from the portal. "History" is defined by what you specified as new documents in Step 3b:

If you chose "by this Crawler," the history includes all documents imported by this crawler that have been deleted.
If you chose "from this Content Source," the history includes all documents imported from this content source that have been deleted. Therefore, you are essentially deleting the history for all crawlers that import documents from this content source.

If you are still sure that you must delete the record of documents deleted from the portal, click Clear Deletion History.

If you are editing an existing crawler, you see additional options under Rejected Documents. Under Rejected Documents, specify what to do when this crawler finds a previously rejected document. Again, the definition of "previously rejected" depends on the option you chose in Step 3b:

If you chose "by this Crawler," previously rejected documents include all documents rejected by this crawler.
If you chose "from this Content Source," previously rejected documents include all documents rejected from this data source.

To have this crawler try to import previously rejected documents, select Re-import.
To delete the rejection history, click Clear Rejection History. Remember, if you chose "from this Data Source" is Step 3b, you are essentially deleting the rejection history for all crawlers that import documents from this content source.

Note: If a document does not sort into any folder but is placed into the Unclassified Documents folder, this does not count as being rejected. Rejected documents are documents that were not placed in any folder.

To mark imported documents with a crawler tag, type a tag in the Mark imported documents with the following Crawler Tag box. This tag is used to differentiate documents imported by this crawler from those imported by another crawler.
Under Runtime Configuration, set the following:

Maximum document-fetching threads - determines the maximum number of concurrent threads used to fetch content from the content source.
Maximum card-indexing threads - determines maximum number of concurrent threads used in processing content once it has been crawled into the portal.

The allowable ranges for these fields are set in the portalconfig.xml file. The values set here are also limited by the maximum threads allowable in the automation service used for this crawler job.

To display the page associated with this help topic:

Click Administration.
Open the Crawler Editor:

To create a new crawler:

Open an administrative folder.
In the Create Object drop-down list, click the type of crawler you want to create.

To edit an existing crawler:

Navigate to the crawler you want to edit.
Click the crawler name.

On the left, under Edit Object Settings, click Advanced Settings.