FAQ

Indexing

Some characters (ä,ö,ü,à,â) aren't displayed properly

  1. heck your browser supports UTF-8 encoding. Internet Explorer ,
    Firefox, Mozilla/Netscape, Safari and Opera do - though some do have problems! Go to
    http://www.macchiato.com/unicode/Unicode_transcriptions.html to test it.
  2. If you're running the Apache webserver check the value of 'AddDefaultCharset' in your httpd.conf. It must be set to 'utf-8'. See
    http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset for details

How to index dynamic sites

In general you will need at least version 0.940 for this.

Use the "Force parsing via HTTP" option, described in the quick start. (Read Indexing your Site -> Notes -> "Force parsing via HTTP" and When trying to index a (php) script the contents, not the result of the script is indexed - why? (below))

Some specific problem:

Question:

I have a site where I have information in a database. The user gets shown a list, where the links are pagename.php?id=n.

I'm hoping to write a script that will feed data to TSEP, where I concatenate all of the data fields and return them as the text, and construct the the full url including the id.

My question is will the full url (with parameters) with the id be preserved by tsep, or will it only store the pagename?

So this user want to feed content to TSEP and also save the full URL, including ?id=n to TSEP as page address.

Our answer was:

You will need at least version 0.940 to do this!

fillwithcontent.php (/admin/examples/fillwithcontent.php) is the key to this question. It will deliver filenames and content - phpcrawl4tsep.php would deliver only filenames.

But the user was still having problems:

Tried it, and I seem to be running into a problem with the 'fileextensions to be included' list. It thinks my files have an extension of '.php?id=6'. Here are the errors I get:

2 pages NOT to be indexed (type: ExtDisAllowed)
type directory/filename filter
ExtDisAllowed http://www.website.com//FAKE.txt \.(htm|html|php)$
ExtDisAllowed http://www.website.com//View.php?id=6 \.(htm|html|php)$

The first entry is a test. When I add txt to the fileextensions, both of these errors go away - the call to add the FAKE.txt works, but the calls to add 'file.php?id=x' does not cause any visible error, and don't give any indexed entries.

This was answered as follows. Pay close attention to the regular expression used - as you might also use this for other purpose:

use as 'fileextension to be included':

htm,html,php,txt[^ ]*

The fileextension-definition is a comma separated list of regular expressions. Each comma is replaced by a pipe-sign and the complete string is embedded in ".(" and ")$".
Example: "htm,php" becomes ".(htm|php)$"

Note: Dots are removed from the string, given as fileextension-definition.
Therefore, for your request, you can not define "txt.*", this would result in "txt*".
If you use "txt[^ ]*", this works fine.

Related topics:

When trying to index a (php) script the contents, not the result of the script is indexed - why?

Question:

I'm confused about indexing a site. I instructed TSEP to index .php files (via the "Fileextensions to be included" parameter) expecting it to index the result of the php-script, not the content. For your information: I let it index a Mambo driven CMS site.Is this the designed behaviour or am I missing something?

Answer:

If you want the result of the php (or any other) script, use the force-http option in the indexer ("Force parsing via HTTP") (introduced in version 0.940)
Otherwise TSEP should return the contents of the.

Related topics:

Scheduling: cron / at

Please see "Scheduling: cron / at" in Advanced Configuration for this extensive topic.

How can I change the filetypes TSEP indexes?

This has been changed in 0.934 - now you can simply enter the filetypes (extensions) you want TSEP to index on the indexer page. Please seperate different types by comma only (no spaces etc). Also make sure that you pay attention to the case of the extenstions: "php" is not equal "PHP" on Unix/Linux systems!

Example: html,htm,php

Searching

What Does the "rank" of the pages mean?

Rank means that all pages are shown ordered by the number of hits they received by all search words. Example: You get 2 results after a search, on the page with rank 1 the search words were found more often than on the page with the rank 2 - simple but very useful if you have many pages on your site and the user might face lots of results.

Other

How can I change the look of TSEP to fit it best into my own layout?

This is simple but takes a little while. To make things as easy as we can, we will take a look on the result page step by step. The formating we show you here is from version 0.911. It might change in future but still be pretty much the same.

Please note that there are additional div-blocks in the search page. Those are only shown when errors occur (stopword was searched, MySQL version to low...) Therefore we leave it up to you for now to look deeply into these formattings and for the general users sake we stick with something most people will see.

If you have done some nice formating we would appreciate it if you could contact us and send us your CSS file so that we could include it in a new TSEP version.

All of TSEP - on all TSEP pages is in the following div container to provide a global area for TSEP.

With this knowledge already you can change the look very much, for example setting the .tsepProject class in the tsep.css file to another font. This will change all fonts in the TSEP area to whatever you define.

Now that you know the header, let's look on the next part of the search page: The .SearchBlock which contains the search form fields and the help - which as you can see has it's extra div container .SearchHintsHelp .

searchblock with div tags

This SearchBlock is being followed by another .SearchBlock which provides status information. This whole block is repeated at the bottom of all search results. If you know a little about CSS you should be able to format this block to fit your needs.

search status output with div tags

This first container of this type is followed by our search results. Here we use the following classes:

.SearchResultAllPagesBlock - this is the block of all the results.

.SearchResultOnePageBlock - this is a block of one resulting page.

.SearchResultOnePageTitle - this is the title of the webpage we found in the database.

.resultnumber - this is the rank of the page. (details: rank).

.SearchResultPageRank - displays how many times the page had a hit from the searchwords.

.SearchResultOutput - these are the words which we indexed - until we encounter the first "explode" character (a . (dot) right now).

.foundSearchWord - this is one of the words the user has searched. We can mark it special so that the user sees it faster.

.SearchResultOutputMore - these are the little dots which show the user there is more on the page.

.SearchResultURL - is the URL of the page we have found, extended by the size of the page (as written in the database).

search results and div tags used

Creating a new language

Please note that since version 0.940 we have changed the way of translation.

If you want to translate TSEP to a new language or you want to help with translation (for example when updating a language is required), please contact us. We will arrange that you will be listed as translator for a language.

The translation itself is done online. This ensures that several translators can work together on a single translation.

What are the index.htm (size 0k) for anyways?

The index.htm files you find in some directories are for security only. If at the webserver the directory listing is enabled the user would see the complete contents of the directory if there would be no index.htm. With the (empty) index.htm he will see nothing in his browser when he tries to access the directory directly.

Information

TSEP code documentation

TSEP has been documented with phpxref. You can download this from Sourceforge as well: http://sourceforge.net/projects/phpxref/

'What Version am I running?'

The version of TSEP is included in the 'title' tag of the copyright notice. This means that you can move your cursor over the copyright notice (on the bottom of the search page for example) and after a little while your browser should display the version number.

The version number is read from a textfile in the include directory named tsepversion.txt. There is no need to change anything in this file: It is maintained by the programmers.

How do I get information about my server environment? (PHP Info)

It has come to our attention that especially new users to PHP might have problems getting some information about their server. This information though might be needed if there are any problems.

For this reason we include a file called tsepinfo.php in the admin directory. Assuming you have installed TSEP in www.yourdomain.com/tsep simply point your browser to

http://www.yourdomain.com/tsep/admin/tsepinfo.php

to receive information about TSEP and your server.

Restrictions

What files can TSEP index

TSEP can index text files (ASCI, UTF-8) only. Text files are usually (examples) TXT, ASC, NFO, HTM, HTML, PHP, PHP3

You can not index any binary files! Binary files are (examples) ZIP, PDF, DOC, XLS, EXE, GIF, JPG, JPEG, PNG

MySQL restrictions

  1. When you want to order the results in your logview.php by IP address, MySQL v3.23 or higher is needed.
  2. UTF-8 handling:
    TSEP (>=0.940) uses Unicode (UTF-8). MySQL versions before 4.1 do not calculate the length of 'special' unicode-charactes (e.g. é,Â,...) correctly - we have created a workaround for this though!
    MySQL does (in principle) not find words with length <= 4. Words containing such 'special' unicode-characters may not be found, because the word-length is computed incorrectly.
    You might want to read this page for details of UTF-8 handling in MySQL: http://www.akadia.com/services/mysql_survival.html
  3. There are certain MySQL restrictions to a full text search:

Quote:

Any word that is too short is ignored. The default minimum length of words that will be found by full-text searches is four characters.

But: There is a workaround! While the search for "a" (without the quotes) will not work because the length is only 1 character, the search for "a***" (without the quotes) will work! This should return all pages where the letter "a" shows up.

Quote:

Words in the stopword list are ignored. A stopword is a word such as ``the'' or ``some'' that is so common that it is considered to have zero semantic value. There is a built-in stopword list.

Also there is a 50% threshold. This means, that if a (searched) word occurs in at least 50% of the rows searched, it will not be returned by MySQL. This applies to full text search only. We are using full text search in TSEP! To get around this behaviour which is explained in 13.6 Full-Text Search Functions we recommend you put those words into your stopword list. This will at least show the user who searched that the word has not been searched for.

For more details you might read on the source page of these quotes: 13.6 Full-Text Search Functions

The restrictions are covered on 13.6.3 Full-Text Restrictions

People with access to the MySQL server though can fine-tune their MySQL to overcome these restrictions. You find information about this on 13.6.4 Fine-Tuning MySQL Full-Text Search

More on built-in MySQL stopwords you will find when you search the MySQL page for "stopword list". A list of words which we think are compiled into MySQL is in the docs directory: stopword-mysql.txt

Personally I do not see the big problem about the built-in stopwords because they are so general that probably no one really trying to find something will enter "you" as a search word. Searching is nothing new to people so that they will enter words which they think match what they need best. This also comes down to that they will enter words which are probably long enough not to fall under the length restriction. Also those are English words and TSEP is now ready for other languages as well. (Olaf)