The
phpMySearch search engine system is a complete world wide web indexing and
searching system for a small domain or intranet. This system is not meant to replace the need for
powerful internet-wide search systems like Lycos, Infoseek, Webcrawler and
AltaVista. Instead it is meant to cover the search needs for a single company,
campus, or even a particular sub section of a web site.
Search
Engine utilizes PHP, MySQL , CURL library, and Adobe PDF to HTML converting
gateway.
*nix
or Windows 95/98/NT system environment
PHP
4.0.2 or higher
MySQL 3.23.32 or higher
CURL
7.0.2 or higher (Usually comes with
php, but should be enabled in php.ini)
·
$DBName Name of MySQL database which will be
used by search engine
·
$DBUser MySQL username
·
$DBPassword MySQL
password
·
$DBHost Host
where your MySQL server resides
NOTE
All required tables will be created automatically.
Download the latest version of CURL from http://curl.haxx.se/download.html
Current version of CURL is located at http://curl.haxx.se/download/curl-7.8.tar.gz
Uncompress it:
gunzip curl-7.x.tar.gz
tar xvf
curl-7.x.tar.gz
NOTE!!! You have to make a patch in order libcurl has
a correct version number
Find a file lib/Makefile.am and make the following
changes:
Line #2
# $Id: Makefile.am,v 1.22 2001/05/29 19:17:03
bagder Exp $
change with
# $Id: Makefile.am,v 1.22 2001/06/11
12:30:57 bagder Exp $
Line #19
libcurl_la_LDFLAGS = -version-info 2:0:1
change with
libcurl_la_LDFLAGS = -version-info 2:1:0
Install CURL with
./configure
make
make install
Compile CURL support in php
Go to the php sources folder and
./configure --with-apxs
--with-curl
make
make install
make a symbolic link to libcurl.so.1 in /usr/lib
ln -s
/usr/local/lib/libcurl.so.1 /usr/lib/libcurl.so.1
If it is your first installation of php – recompile
apache to use php as DSO module
cd ../apache_1.3.x
./configure
--activate-module=src/modules/php4/libphp4.a
make
make install
Install php as CGI/Commandline to enable execution of
spider as it is executed from the command line
cd ../php-x.x.x
./configure --with-curl
make
make install
Don’t worry php still will be executed as apache
module too.
Uncomment the following line in php.ini
;extension=php_curl.dll
Find the following libraries in your php distribution
php4ts.dll,SSLEAY32.dll,php_curl.dll,MSVCRT.dll,
libeay32.dll
and copy them to Windows\System
To access administrative
interface run script admin.php
Default login and password are:
login: admin
password: admin
RECOMMEND TO CHANGE
Table 1-1:
Fields and buttons on the Administration page.
Field or button |
Description |
DB Main table name |
Name of the table, which will be used, for storing and
indexing documents. The default name is _Pages |
DB Settings table name |
Name of the table, which will be used, for storing search engine settings. The default name is settings |
DB Spider state table name |
Name of the table, which will be used, for storing spider
working state. The default name is spider |
Search start URL's: |
A list of URLs from which crawler will start to gather
information. To add new URL to the list type it in the field below and push ADD
button. To remove any of the URLs check the checkboxes near URLs you’d
like to delete and push REMOVE button. |
Not indexed URL's (Black list) |
List of URLs, which will be ignored by crawler. . To add
new URL to the list type it in the field below and push ADD button.
To remove any of the URLs check the checkboxes near URLs you’d like to
delete and push REMOVE button. |
Document extensions to index |
A list of document extensions which spider should try to
index. To add new document extension to the list type it in the field below
and push ADD button. To remove any of the extensions check the
checkboxes near extensions you’d like to delete and push REMOVE button. |
Search depth |
Search depth tells spider how much iteration he should
follow links from the pages and proceed with crawling. 0 - don't follow any links 1 - follow links only from the first page 2 - follow links from the first page + 1 3 - follow links from the first page + 2 … |
Reparse all |
If check box is checked spider will clean database and start
to ramble, otherwise it will parse only pages that were updated. By default
it is unchecked |
Automatic spider start |
Check this box if you have not access to crontab tool at
*nix systems or task scheduler in Windows system. If it is checked each time
visitor use search script, it will check whether it is time to start the
crawler. If it founds that it is time to start the time or crawler was not
started at specified time script will start the spider script. It is recommended that you use system scheduling
utilities to start the spider. |
Start time |
Time to start the spider |
Start spider each (days) |
Period in days to restart the spider (in days) |
Force crawling |
Click on Start Spider button to start spider
immediately |
Number of links per page |
Specify here number of links, which should be displayed
on a single page. |
Max pages block |
Specify how many pages should be visible in pages menu |
Search Engine log file name |
Enter path to the file where search engine will log its
work |
Spider Engine log file name |
Enter path to the file where spider will log its work |
Admin Tool log file name |
Enter path to the file where will be logged all changes
in admin tool |
Templates path |
Path to template files |
Admin Login |
Badminton login name of administrator |
Admin Password |
Administrator’s password |
Confirm Password |
Confirmation of administrator’s password |
Submit |
By clicking on Submit button you’ll save all
changes |
Searching
If you want to search for a single word it is simple. Just type the word
you’d like to search for and press Submit.
You also can use Boolean logic to narrow
search. See table below for operators allowed.
Table
2-1:Search Boolean logic.
Operator |
Description |
AND |
Finds
documents containing all of the specified words or phrases. Peanut AND
butter finds documents with both the word peanut and the word butter. |
OR |
Finds documents containing at least one of the specified words or phrases. Peanut OR butter finds documents containing either peanut or butter. The found documents could contain both items, but not necessarily. |
AND NOT |
Excludes
documents containing the specified word or phrase. Peanut AND NOT butter
finds documents with peanut but not containing butter. NOT must be used with
another operator, like AND. Search engine does not accept 'peanut NOT
butter'; instead, specify peanut AND NOT butter. |
OR NOT |
Finds
documents containing one of the specified words or phrases or not containing
other word. Peanut OR NOT butter finds documents which contain Peanut
or not containing butter |
“” |
Quotation marks are used to denote exact phrases. For
example, a search on "New York Times" will match only
documents containing the words as an exact phrase. It will not find pages
with the words used in a different order, such as "New times in
York!" |
{ } |
Braces are used to denote folders. For example, a search
on "CPAN/objects" will match only documents stored in
www.servername.com/currentlocation/CPAN/objects |
You also can navigate
through the site folder structure. In the dropdown box before Submit
button you will see list of subfolders of the current folder. By selecting the
folder name you localize search to this folder and its subfolders. ‘..’
allows you to go one step up