Advanced Configuration

Using external data supply

What's meant by external Datasupply?

This was introduced in version 0.927. Building the index (using the indexer) needs an algorithm to find the files to be indexed. The TSEP-integrated filefind-algorithm reads each (sub)-directory, starting at the given starting-directory, to collect the filenames found there, to be indexed.
In addition to this integrated filefind-algorithm, TSEP gives the ability to build the index for files, whose filenames (urls) are supplied from outside of TSEP (e.g. a simple file(url)list, a filelist returned by a crawler/spider-process,...).

Also see:

How to use external Datasupply:

How to send data to TSEP

The external datasupply has to be a .php-script, which has to communicate with TSEP in the following way:
At the TSEP-admin-page "build-new-index", the (fully qualified) name of this .php-script has to be given.
Urls,... are returned to TSEP via

call_user_func("TSEP_ExternalCallBack", returnstring);

returnstring has to be one of the following:

"URL>an_url"
an_url will be indexed by TSEP
 
"ERR>an_errormessage"
an_errormessage will be echoed to the browser as error
"INF>an_errormessage"
an_errormessage will be echoed to the browser as information
 
"ALL>an_url<tsepcontent>content_of_the_file"
an_url will be indexed by TSEP, but TSEP does not read the file. The file-content is taken from content_of_the_file.

In the admin/example-directory you can find examples, how this external datasupply feature can be used. Try using "examples/urllist.php" as external datasupply for your first tests (it's a very simple datasupply and makes it easy to understand, how this feature works).

examples/phpcrawl4tsep.php is an up-and-running samplescript to communicate with an installed PHPCrawl. Just place the phpcrawl4tsep.php into the same directory as PHPCrawl (there where the example.php from PHPCrawl is) and call it from the TSEP indexer.

Attention:

Example of how to correctly configure to run phpcrawl4tsep.php from within TSEP:
assume:
www.mydomain.de/index.php entry-page of your site
www.mydomain.de/php/tsepsearch installation-directory of TSEP
www.mydomain.de/php/tsepsearch/admin/indexer.php TSEP-indexer.php-script ("build-new-index"-startpage)
www.mydomain.de/php/phpcrawl/ install-directory of PHPCRAWL
www.mydomain.de/php/phpcrawl/phpcrawl4tsep.php our samplescript

The picture of the installation shows our example with it's settings.

Definitions, made available by TSEP

Parameter, entered at TSEP-admin-page "build-new-index", are made available for the external datasupply script via public variables:

$TSEPdirname
Directory given as first entry in the "build-new-index"-startpage
$TSEPwebdir
WebDirectory given as first entry in the "build-new-index"-startpage
$TSEPdirexclude
directory-excludes given in the "build-new-index"-startpage
$TSEPfileexclude
file-excludes given in the "build-new-index"-startpage
$TSEPlistFilenamesOnly
"1", if the '"would-be-indexed"-filelist only'-checkbox is checked
$TSEPparmsexternalphp
this is the value entered on the "build-new-index"-startpage in the field "Enter parameter to be sent to datasupply-script" This can be any string, which the external datasupply script needs.
e.g. if the external datasupply script is a crawler, this normally needs an entry-html-filename, where the search has to start. This filename can be passed to the script via the field "Enter parameter to be sent to datasupply-script" at TSEPs "build-new-index"-startpage and can be read by the external datasupply script via variable TSEPparmsexternalphp.
$TSEPextinclude
this is the value entered on the "build-new-index"-startpage in the field "Fileextensions to be included". In $TSEPextinclude, whitespace are removed and the extension-list is pipe-separated (e.g. "htm|html|php")

TSEP Tags for your code

In version 0.938 we introduced the first tags.

You can simply add those to your pages which will be indexed to give TSEP instructions.

At this time there are 3 different tags:

<!-- tsep:cmd:start/ -->
Ignore (do not index) all before this tag.
<!-- tsep:cmd:end/ -->
Ignore (do not index) all after this tag.
<!-- tsep:cmd:noindex --> and <!-- /tsep:cmd:noindex -->
Ignore all inbetween those two tags (the word "and" in this case)

Scheduling: cron / at

TSEPautoIndexing.sh should be placed in the admin/examples directory. This shellscript should be called by cron (Linux). The equivalent for windows systems is a new script, TSEPautoIndexing.cmd. Some detailes instructions:

How to initiate indexing via unix-command curl (intended to be combined with cron):

launching indexer using current IndexingProfile:
curl http://.../admin/indexer.php -d startindexing=startindexing -o <out.htm>
launching indexer using specific IndexingProfile:
curl http://.../admin/indexer.php -d startindexing=startindexing -d profile=<name-of-profile> -o <out.htm>

Examples:

or

Important Notes:

Hint:

Use the shell-script admin/examples/TSEPautoIndexing.sh

Please adjust the two variables within that script to your needs first (see there). This script can be called without parameter to launch indexer using the current Indexingprofile. Examples:

or

simply use the name of the IndexingProfile to be indexed as parameter:

This script runs the indexing-process and stores the resulting html-outputfile into the tsep-subdirectory "bgindexing.log" using a filename containing current date/time and indexingprofilename. You can later browse the desired file using your favorite browser to check the results

ContentImages

In general

Usually, searchresults are shown in textformat as page title, part of the content and the link to the page. ContentImages can be shown in addition. This, what we call ContentImages are images of your webpages! More or less tiny screenshots, you might know such things as thumbnails for example from Thumbshots.org ( http://www.thumbshots.org/ )

ToDo:

If Delete an Image or ContentImage File List or Upload an Image, currently you have to refresh the window (F5) afterwards manually. We are working on a solution for this "user unfriendliness".

Configure ContentImages

Use ContentImages
Switch on or off, if ContentImages should be used in your TSEP installation
Images-Path for Web-Access
Path, where ContentImages are located at (used by html-img-tag to show the images)
Images-Path for PHPscript-Access
Path, where ContentImages are located at (used by php-script's file-access)
Root path for ContentImage File Lists for Web-Access
Path, where ContentImage File Lists are located at (used by html within "Configure/Manage ContentImages")
Root path for ContentImage File Lists for PHPscript-Access
Path, where ContentImage File Lists are located at (used by php-script's file-access)
Image-Filename-Extension
FileExtension to be used for ContentImages: preferably use ".jpg" or ".png"
Default image
Filename (Name only no path and no extension!). You may upload the defaultimage via the button on the right side. But before, you have to enter the name of the file (don't have to equal to the "pc-file" you are uploading) - if you want to upload a file, all Paths (above) have to be defined AND saved (via "update values above"-button).
maximal display-height
Maximal height of the image to be shown on the result-pages (aspect ratio is kept in association with the "maximal display-width")
maximal display-width
Maximal height of the image to be shown on the result-pages (aspect ratio is kept in association with the "maximal display-height")
The indexer should create ContentImage File Lists
If the indexer is run and this option is switched on, a ContentImage File List (associated to the indexing profile) is created.
Only for pages having no ContentImage
If "The indexer should create ContentImage File Lists", a file list entry is written into the ContentImage File List only, if no ContentImage exist for the page.
Automatically run transformation
Transformation is run automatically after the indexer.
Transformations
ContentImage File List entries can be transformed using a transformation-template into .bat-files, .shell-scripts,... This output can e.g. be used to run an external program for building the screenshots or upload the screenshots.

There are three template examples delivered with TSEP (located in
<tsepinstalldir>/contentimages/filelists/transformation_templates).

1. toWebswoon.bat
create .bat-file, which runs Webswoon (http://www.intellitamper.com/webswoon/) to create screenshots of each page

2. WebswoonCopy2Host.bat
create a .bat-file, which copies the created screenshots (Webswoon-results) into the directory, where the ContentImages resides ("Images-Path for PHPscript-Access")

3. WebswoonFtp2Host.bat
create a .bat-file, which uploads the created screenshots (Webswoon-results) into the directory, where the ContentImages resides ("Images-Path for PHPscript-Access")
These templates are examples, which has to be adjusted to your needs before use. These examples are thought to be used with Webswoon and are designed to be used under Windows.

Please have a look into the directory <tsepinstalldir>/admin/examples, where you can find two shell-scripts to be used to create screenshots under *nix systems: wwws.sh and wwwshot.sh.

We will add an example-template for creating wwws.sh in future.
TSEP currently support up to two transformations.
Templatefilename
Filename+Extension. Do not enter a path.
Currently, template files has to be located under <Root path for ContentImage File Lists for PHPscript-Access>/transformation_templates.
The extension of the generated outputfile is gathered from this template filename.
Active
You may deactivate a templateexpansion here.
Commentline starts with
Cause the transformation writes additional comments into the generated outputfiles, you have to define the prefix to be used, to retrieve a commentline (e.g. '@REM' for .bat-files, '#' for .sh-scripts).

Manage ContentImages

ContentImage File Lists
In this area, all existing ContentImage File Lists and all associateded transformation-outpufiles are shown.
You may open, download or delete every file.
On ContentImage File Lists you may launch a transformation.
Manually create ContentImage File List, from currently indexed Pages
In this area you may select an existing Indexing Profile and create a ContentImage File List manually.