Preventing data from being indexed

Robots exclusion is a way to prevent data from being indexed. Domain-wide search does not index artifacts that contain a URL pattern specified in the robots.txt file.

The robots.txt file resides in the www area of the version control repository. External spiders and indexers use the robots.txt file to decide which web pages not to index. Each domain has a default robots.txt for this purpose.

Creating and Working with a robots.txt file

The robots.txt file is placed in the Version Control area directly under the www folder in a project. You can access the robots.txt file by using the URL pattern http[s]://projectname.domainname/robots.txt.

Format of the robots.txt file

The robots.txt file consists of one or more records. Each record is in the form <field>:<optionalspace><value> <optionalspace>. The field names are User-agent or Disallow, the meaning of which is explained below. A record starts with a User-agent field and can have one or more Disallow fields; it should have at least one Disallow field. The "#" is used to indicate that the line is a comment. Anything after "#" will be ignored by the robots.txt parser.

User-agent:

The value of this field is the name of the robot or crawler or search engine for which the record describes the access policy. The user-agent name is case-insensitive. A robots.txt can have one or more user-agent fields. If the value is "*", this record describes the default access policy for any robot whose name has not been specified in any of the other records. Only one record can have the value "*". The internal user agent is "CEE."

Disallow:

The value of this field specifies a URL pattern that is not to be indexed by the Search engine. The URL should be the URL to access any artifact that is indexed. Only the resource portion of the URL should be specified in the robots.txt file. For example, take the URL https://domainname/issues/show_bug.cgi?id=1111. The resource portion of the URL is /issues/show_bug.cgi?id=1111.

The disallow value can be either the full resource URL or a partial one. An empty value for the Disallow field specifies that nothing is to be excluded and a "/" without quotes specifies Exclude All.

The Disallow: field /issues/ will exclude all the URL patterns that contain /issues/ within them.

Sample robots.txt files

The following are some valid robots.txt files.

Sample 1:

User-agent: * # to be respected by all robots
Disallow: /issues/ #excludes all the URL patterns that contain /issues/ ex:- /issues/index.html. But does *NOT* exclude /issues.html
Disallow: /specs # excludes /specs.html and /specs/robots.html.
Disallow: /source/browse/helm/depend.txt #excludes the specified file.
Disallow: index.html # excludes all the index.html files.

Sample 2:

The following robots.txt file excludes the whole site from being indexed by the robot CEE
User-agent: CEE
Disallow: /

Sample 3:

The following robots.txt file allows the robot named CEE index all the data contained.
User-agent: CEE
Disallow:

Sample 4:

The following robots.txt file specifies that no robots should visit any URL containing the pattern /source/, except the robot CEE.
User-agent: *
Disallow: /source/
User-agent: CEE
Disallow:

Location and Overriding

To override the settings made at the domain level, add the robots.txt at the project level, under the www area of the version control in the project.

These project level robots.txt files are accessible by the URL http://projectname.domainname/robots.txt. As the project owner, you may want to exclude some artifacts in the project from being indexed. In which case the robots.txt file in the project should be used for exclusion of artifacts instead of the global robots.txt file located at the domain level.

Consider the following robots.txt file at the domain level.

User-agent:*
Disallow:/source/
User-agent: CEE
Disallow: /issues/
Disallow: /servlets/

The following robots.txt file is committed to the www area of the version control in the project "ProjectSample".

User-agent: CEE
Disallow: /search/
Disallow: /servlets/

In the above example, the exclusion of artifacts in the project "ProjectSample" will be based on the robots.txt file directly at the project level and the file at the domain level will not have any impact in the robots exclusion. That is, URLs containing /search/ or /servlets/ only will be excluded. Artifacts with the URL containing /issues/ will be indexed although it is mentioned in the domain level file.

The robots.txt file at the domain level can be overridden by the robots.txt file at the project/projectgroup/category level. But the robots.txt file at the project level cannot be overridden at its subprojects level.

Note: When a robots.txt file is overridden, the User-agent:* record at the domain level should be included at the project level, otherwise there will not be an entry for other robots. This is important. If there is no User-agent:* at the domain level, then the user (project owner) can decide whether or not to have a User-agent:* record at the project level.

Rules for Overriding the robots.txt file

The overriding applies at the record level and not at the individual Disallow entries.

The default robots.txt at the domain level excludes all the URL patterns containing /source/, /search/, /issues/ and /servlets/. But for the CEE internal indexer, you may not want to exclude these by default. This can be done by having a separate record for the CEE internal indexer. The following robots.txt file should be used as the default file:

User-agent: *
Disallow: /source/
Disallow: /search/
Disallow: /issues/
Disallow: /servlets/
User-agent: CEE
Disallow:

Be careful to include a record for the internal domain-wide indexer in every robots.txt file used. Otherwise the results will be unpredictable. For example, the following robots.txt file can be committed to a project.

User-agent: *
Disallow: /source/
Disallow: /search/
Disallow: /issues/
Disallow: /servlets/

The URL to access an issue in a project is of the form http[s]://project.domainname/issues/show_bug.cgi?id=xxxx.

The above entry in the robots.txt file will cause all issue tracker issues to be excluded from indexing as it contains the word /issues/. Similarly the URL to access email messages, users etc. has the URL pattern /servlets/. Include the following entry in the robots.txt file to avoid problems.

User-agent: CEE
Disallow:

The above entry will not exclude any artifacts from being indexed by the internal domain-wide indexer.

Note: - To be able to use the robots exclusion feature, you should be aware of the URL name space.

To obtain the URL of a particular artifact type, issue type or document, you will have to visit the artifact type, issue type or document and see the URL displayed on top. For example:

To identify the URLs for a particular artifact type or issue type you can do the following:

  1. Log in to CollabNet.
  2. Select a project that has Issue Tracker as it’s tracking tool.
  3. Click the Issue Tracker link.
  4. Enter an issue number, for example 21443.
  5. The URL displayed is probably something like https://domainname/issues/show_bug.cgi?id=21443
  6. If you choose “/issues/show_bug.cgi?id=21443” as the URL for exclusion from indexing, that is - as the Disallow entry, then the issue 21443 will be excluded from being indexed.

Similarly, if you want to identify the URL for a particular folder:

  1. Log in to CollabNet.
  2. Select a project.
  3. Click the Documents and Files link.
  4. Create a folder.
  5. Select that folder you created.
  6. The URL displayed is probably something like http[s]://[domainname]/servlets/ProjectDocumentList?folderID=1.
  7. If you choose “/servlets/ProjectDocumentList?folderID=1” as the URL for exclusion from indexing, that is,- as the Disallow entry, then the folder whose folder ID is 1 will be excluded from being indexed. This however will not exclude the files under this folder from being indexed. To exclude individual files or documents, you will have to identify their individual URLs and exclude them individually. *
Note: - You can specify a partial URL such as "index" without a preceding slash, but be careful while doing this. If the Disallow entry has "index" as the URL pattern then all of the following will be excluded from being indexed: /specs/indexing/robots.txt, /index.html, /specs/index.html. To avoid confusion, specify the complete URL.

Note:- The robots.txt file is an optional one, it is acceptable for a site to not have a robots.txt file at all.

Some important points to remember

1. You have to decide where to add the Disallow entry. You can add it in the robots.txt available at the domain level (www project) or the one at the project level. If you want the rule to be applied to all the artifacts in all the projects then make the entry in the global robots.txt file. If you want the rule to be applied to only the artifacts within a project then add the entry to the project level.

Note:-Users belong to a domain, so if you want to prevent a user from being indexed, make the entry in the global robots.txt file available in the www project. Similarly if you want some Help documents not to be indexed make an entry in the global robots.txt file available in the www project.

2. Preventing a project from being indexed.

This depends on whether there is a resource portion of the URL. For example the URL to access the www project is http[s]://www.domainname/servlets/ProjectHome. Here if a Disallow entry for the URL pattern "/servlets/ProjectHome" is added to the www project for the internal domain-wide indexer then the project will not be indexed. If Virtual hosting is enabled then there will not be any resource portion of the URL and in this case the robots.txt file cannot be used to prevent the project from being indexed. If Virtual Hosting is not enabled then the URL to access the project will be of the form http://domain/...?projectID=$ID. Here, there is a resource portion of the URL for the project. For excluding projects, the Disallow entry can either be in the default project "www" or in the project level. If added in the robots.txt file committed to the project "www" then it will be applied to all the projects.

Note:- Excluding a project does not mean excluding all the artifacts in the project automatically. Individual artifacts need to be excluded individually.

3. Preventing deleted news items from being displayed in the Search results page even after incremental indexing.

A deleted artifact is removed from the index only during the fullIndexRebuild which usually happens every 7 days. The robots exclusion can be used to remove the news item from the index during the next incremental indexing. If the news item ID is 45 then adding the following disallow entry for CEE User Agent in the robots.txt committed to the project where the news item is created will remove the item from the index and further searches will not return the news item in the Search results page.

Disallow: /servlets/NewsItemView?newsItemID=45

Examples of creating a robots.txt file

Example 1

To prevent a user from being indexed, add the following Disallow entry in the robots.txt file committed to the default project "www" in the record for the internal domain-wide indexer CEE.

Disallow: /servlets/UserEdit?userID=xxxx

The User ID of the user can be identified from the UserEdit page by clicking on the username. The URL will contain the User ID.

Example 2 :

To exclude the commits@ProjectSample.domainname discussion and all the mail sent to the list from being indexed by the internal domain-wide indexer, add the following disallow entries under the User-agent:CEE in the robots.txt file available at the project "ProjectSample".

User-agent: CEE
Disallow: /servlets/SummarizeList?listName=commits # this excludes the list.
Disallow: /servlets/ReadMsg?listName=commits #this excludes all the messages sent to the list.
Disallow: /servlets/GetAttachment?list=commits # this excludes all the attachments.

Note: - If the above is added to the global robots.txt file then the commits list and its messages will not be indexed in any of the projects.

Limitations

Adding a default robots.txt file during project creation

To add a default robots.txt to the version control area of each project during project creation follow the steps explained below.

  1. Create a template for your project.
  2. Create a robots.txt file with the required contents and add it to that template.
  3. Select this template during project creation.
  4. All the files included in the template are automatically created including the robots.txt file in the project's version control repository.

Note:-For more information refer to the Content Repository Templating Help page.

 

Top | Help index