Previous Page TOC Next Page See Page



— 14 —
WebSearcher: A Simple Search Tool


The vast amount of information available on the World Wide Web is provides its greatest strength. Millions of resources scattered across the world are available any day, at any time, to anyone with a connection and a browser. . .Great, right?

Try finding something you need. Since the creation of the World Wide Web, search tools have come full circle from non-existent to extremely prevalent. I can think of ten different ones off the top of my head, and I'm sure many more exist.

Search engines provide the compass to the Web traveler. Either by going through a series of narrower and narrower categories, or by just typing in a keyword.

The search engine then looks through its database of registered URLs and returns to the user a listing of what it found that lines up with the keywords entered.

Right now most search engines depend on the use of a Web browser such as Netscape Navigator: You go to the site of the search engine, enter your query, and in a moment or two, the browser displays a set of pages that match your request.

Now, using a custom control such as the Sax Webster control described in Chapter 4, "Using Custom Controls for WWW Programming," you can put this search engine on the user's desktop.

Designing the Application


There are a few basic capabilities you'll want this search engine to have:

http://www.microsoft.com/kb/peropsys/win95/q138789.htm

http: //www.microsoft.com/kb/peropsys/win95/q153038.htm



For some reason the Webster control seems to have trouble communicating over certain dial-up TCP/IP stacks, including the stack included with Internet in a Box. If at all possible, use the Microsoft TCP/IP stack that ships with Windows 95. It is very easy to configure the Dial-Up Networking feature of Windows 95 to work with just about any Internet service provider. There are two Microsoft Knowledge Base articles available at the Microsoft Web site that explain how to do this. Point your browser to http://www.microsoft.com/kb/peropsys/win95/q138789.htm and http://www.microsoft.com/kb/peropsys/win95/q153038.htm to view these articles.

With these ideas in place, you can start thinking about the processes involved in setting up the form and the flow of the program.

Knowing the base functionality of the program, you can chose the controls for the program. Figure 14.1 shows how the form should look when you're finished with this section.



The code for this chapter is included on the CD-ROM accompanying the book. You may enter it as you follow along in this chapter or copy it from the CD.

CD-ROM: I've included a file named CODE-17.ZIP that contains the code for this chapter. –Craig


Figure 14.1. Design time view of the Web search application.

To create this form, follow these steps:

  1. Start a new project in Visual Basic. Add the Webster control using Tools | Custom Controls. You'll use this control to load the URLs and parse the links found on the loaded pages.

  2. Increase the form's size to allow enough room for all the controls.




  3. This application was originally designed on a monitor set at 800 x 600 resolution. If you are limited to, or prefer, 640 x 480, you'll probably have to shrink some of the controls on the left hand side of the form to get everything to fit.


  4. Add a Webster control to the form, positioning it so it takes up the right half of the form.

  5. Set the HomePage property to an empty string. Set the LoadImages property to False since the program is only searching for text.

  6. Set the PagesToCache property to 1 to remove any page caching. This is necessary to ensure that Webster loads and parses each page. Otherwise, a cached page would not be properly searched because Webster merely re-displays cached pages without firing the LoadComplete event, or any event for that matter.

  7. Using the control's custom properties page (select the Custom property in VB's Property Window and click the button with the ellipses), select the Display tab and turn off all the Buttons checkboxes except for the Back/Forth checkbox. The tab should appear as in Figure 14.2.




  8. You can leave more of the buttons enabled if you wish, but you must make sure that while the application is performing a search, the user is unable to change the URL that Webster is attempting to load. The best method for doing so is to modify the control's ButtonMask property right before a search to turn off all buttons. Then, upon completion of a search, turn your buttons back on using the same property.

    Figure 14.2. Webster's custom properties Display tab.

  9. Add a status bar (Status1) and two textboxes: one for entering URLs (txtURL) and one for entering the keyword for the search (txtKey)

  10. Add three listboxes: one for the user-specified URLs to search (lbURL), another for the anchors found within those user-specified URLs (lbAnchor), and one listbox for URLs of pages containing the keyword specified (lbFound).

  11. Add the labels as shown in Figure 14.1.

  12. Add command buttons for adding (cmdURL, Index = 0) and deleting (cmdURL, Index = 1) user-specified URLs, starting the search (cmdSearch), resetting the application (cmdReset), and exiting (cmdExit).

With these controls in place, the form should look like Figure 14.1. If so, you can start adding code to the project. Otherwise, retrace your steps and make it look similar.

Coding the Application


Now that the form's controls are in place, it's time to add some code. This section provides all the code necessary to make the search application operate.

The Declarations Section


There are a few form-level variables defined in the Declarations section of our form's code. Open the form's code module by clicking the View Code button on the VB Project window or by pressing the F7 key. In the Declarations section, enter the following lines:

Option Explicit
Dim fPageLoaded%        'set true when a page is loaded
Dim LBFlag As Integer   'which list box to process

The first line specifies that all variables within the application must be declared before they can be used. The first variable defined is a flag that is used to determine if a page is loaded in the Webster control. Because Webster caches pages, if the first searched page is the current page loaded in the Webster control, Webster won't re-load the page. If Webster doesn't re-load the page, the LoadComplete event won't fire and the program won't process the page.

The second variable, LBFlag, is a flag that tracks which URL list box is being processed. The application only searches the URLs specified by the user and then any URLs linked on those pages. If you're processing URLs from the user-specified list, you'll load the anchor list box with all the links found on the pages. If you're processing the anchor list box, you'll only be searching for text, not links.

The AddMatch Subroutine


This subroutine adds the URL of a page containing the search string to the lbFound listbox. The URL to be added is provided as a string parameter to the subroutine. The code, shown in Listing 14.1, first searches all the URLs currently in the listbox to verify that the URL specified doesn't already exist in the list. If a duplicate URL is not found, the URL is added to the list box.

Listing 14.1. The AddMatch subroutine.

Sub AddMatch(sMatchURL As String)
Dim x As Integer
For x = 1 To lbFound.ListCount
    lbFound.ListIndex = lbFound.ListCount - 1
    If lbFound.Text = sMatchURL Then
        Exit Sub
    End If
Next x
lbFound.AddItem sMatchURL
End Sub

Specifying the URLs to Search


Let's start off by coding the adding and removing of user-specified URLs from the lbURL listbox. Each of the URLs the user adds is searched for the keyword when the Search command button is pressed.

The Add URL and Remove URL buttons are in a control array, the code behind the Click event is found in Listing 14.2.

Listing 14.2. The cmdURL_Click event code.

Private Sub cmdURL_Click(Index As Integer)
'allows user to build URL listbox
    Select Case Index
          Case 0  'add url
            If Trim(txtURL.Text) = "" Then
                Exit Sub
            End If
            lbURL.AddItem (txtURL.Text)
            txtURL.Text = ""
         Case 1  'remove url
            If lbURL.ListIndex < 0 Then
                Exit Sub
            Else
                lbURL.RemoveItem (lbURL.ListIndex)
            End If
     End Select
End Sub

The Case 0 section handles the click for the Add URL button. It verifies that the user actually typed something in the txtURL textbox, and then adds the contents of the textbox to the URL listbox. It then empties the contents of the txtURL textbox. The verification of the URL is handled during the breaking apart of the host name and filename.

Case 1 handles the removal of any user-specified URLs. It verifies that a URL is actually highlighted and then removes it from the list.

The Reset Button


The Reset button performs the following functions:

Set the status bar to "Ready"

Listing 14.3 contains the code for the Click event of the Reset button.

Listing 14.3. The cmdReset_Click event code.

Private Sub cmdReset_Click()
    txtURL.Text = ""
    lbURL.Clear
    txtKey.Text = ""
    lbAnchor.Clear
    lbFound.Clear
    cmdSearch.Enabled = True
    StatusBar1.SimpleText = "Ready"
    Webster1.Cancel
End Sub

The Search Button


The main program flow is handled through two events. After the user input's the URLs and the keyword, the next step would be to click on the Search button. The code for the cmdSearch_Click event is shown in Listing 14.4.

Listing 14.4 The cmdSearch_Click event.

Private Sub cmdSearch_Click()
Dim x As Integer
Dim URL_Index As Integer
Dim Anchor_Index, Last_Anchor_Index As Integer
'verify that we have at least one URL to search:
If lbURL.ListCount = 0 Then
    MsgBox "Please enter a URL into the URL Search List!"
    Exit Sub
End If
'verify that there's some text to search for:
If Len(Trim$(txtKey)) = 0 Then
    MsgBox "Please enter a search key into the Keyword text box!"
    Exit Sub
End If
'disable the search button
cmdSearch.Enabled = False
'clear any existing stuff:
lbAnchor.Clear
lbFound.Clear
If fPageLoaded% Then Webster1.DismissPage ""
'start retrival of user specified url's here
LBFlag = 0
For URL_Index = 0 To lbURL.ListCount - 1
    
    'select index here.
    lbURL.ListIndex = URL_Index
    
    StatusBar1.SimpleText = "Loading " & lbURL.Text
    Webster1.LoadPage lbURL.Text, False
    
    'wait till the page is loaded
    While Choose(Webster1.LoadStatus + 1, 0, 1, 1, 1, 1, 0, 0)
        DoEvents
    Wend
        
Next URL_Index
'ok, have anchors from user specified URL's now get those
LBFlag = 1
For Anchor_Index = 0 To lbAnchor.ListCount - 1
    'select index here.
    lbAnchor.ListIndex = Anchor_Index
    StatusBar1.SimpleText = "Loading " & lbAnchor.Text
    Webster1.LoadPage lbAnchor.Text, False
    
    'wait till the page is loaded
    While Choose(Webster1.LoadStatus + 1, 0, 1, 1, 1, 1, 0, 0)
        DoEvents
    Wend
Next Anchor_Index
'turn the search button back on
cmdSearch.Enabled = True
StatusBar1.SimpleText = "Ready"
End Sub

These first lines do a few essential tasks: First you verify that the user has entered at least one URL to search as well as a string to search for. If either of these are missing, a message box is displayed and the routine exits. Next, you disable the Search button so the user doesn't keep clicking it. Because the response time varies from a few seconds to a few minutes depending on the number and which URLs are specified, you don't want the event fired more than once before the whole procedure is completed.

Next, the result listboxes (lbAnchor and lbFound) are cleared. If a page has previously been loaded, the Webster control is also cleared.

You then set the LBFlag (listbox flag) to zero. This is used in the Webster control's LoadComplete event to determine if the page that was just loaded was from the user-specified list (lbURL) from or the list of anchors that are retrieved from the user-specified URLs (lbAnchor).

The first loop (starting with the line For URL_Index = 0 To lbURL.ListCount - 1) loops through all the URLs the user entered using the Add URL button. The ListIndex property of the lbURL listbox is set to the current loop index. The status bar text is updated. Finally the page load is started by invoking the Webster control's LoadPage method. After the load is started, it's time to sit back and wait for the Webster control to finish loading the page. This is done with the DoEvents loop that immediately follows the LoadPage method.

The While loop condition contains a seldom-used Visual Basic function: Choose(). The syntax for Choose() is

Choose(index, choice-1[, choice-2, ... [,choice-n]])

The function returns a value from the list of choices based on the value of index. If index is 1, Choose() evaluates to the first choice in the list; if index is 2, it evaluates to the second choice, and so on. Note that if index is less than 1 or greater than the number of given choices, the function will return Null.

The LoadStatus property of the Webster control has the following possible values:

0

Page load is complete

1

Connecting to host

2

Connected, waiting.

3

Page text is loading.

4

Images are loading.

5

Load failure.

6

Unknown—URL failed to load


The application needs to wait until either the page has completely loaded (LoadStatus = 0) or some error condition has occurred (LoadStatus = 5 or LoadStatus = 6). The expression Choose(Webster1.LoadStatus + 1, 0, 1, 1, 1, 1, 0, 0) returns the value 1 until either of these conditions are met. This keeps the While loop active. When either of the conditions is met, Choose() returns 0 and the code drops out of the loop.

After the DoEvents loop has been exited, the application loops to the next URL in the lbURL list box and performs the above operations on that URL.

Part of the processing that's done while the above loop is loading pages is to fill the lbAnchor listbox with all the links found on the pages specified in lbURL. The next section of code in Listing 14.4 loads the URLs from the lbAnchor list box.

The first step is to set the LBFlag (list box flag) to 1. This informs the LoadComplete event that it no longer needs to load the lbAnchor listbox with the links contained in the pages that get loaded.

Next, the application loops through all the URLs in the lbAnchor list box. The code within the loop is identical to the code within the lbURL loop described above.

After the lbAnchor loop has completed, the Search button is turned back on and the status bar displays the Ready message. The application is now ready to begin a new search!

Parsing Loaded Pages


As pages are loaded by the loops described in the preceding section, the LoadComplete event is fired for the Webster control. This section describes the code of the LoadComplete event and demonstrates one of the more powerful features of the Webster control. The code for this event is given in Listing 14.5.

Listing 14.5. The Webster1_LoadComplete event.

Private Sub Webster1_LoadComplete(URL As String, ByVal Status As Integer)
    Dim i%, PageText$, lSize%
    
    fPageLoaded% = True
    
    'if search button is on, we're not searching
    ' so don't process the page
    If cmdSearch.Enabled Then Exit Sub
    
    'check to see whether we're loading
    ' a top level page or a subordinate
    If LBFlag = 0 Then      'top-level
        'fill the anchor list box with all
        ' the URLs on this page
        For i% = 0 To Webster1.GetLinkCount("") - 1
            URL$ = Webster1.GetLinkURL("", i%)
            If UCase$(Left$(URL$, 4)) = "HTTP" Then lbAnchor.AddItem URL$
        Next
    End If
    
    'Get the text on the page
    lSize% = Webster1.GetTextSize("")
    PageText$ = Webster1.GetText("", 0, lSize%)
    'If it matches the search string, add the URL
    'if the search string wasn't in the text,
    '  check the page title
    If InStr(UCase(PageText$), UCase(txtKey)) Then
        AddMatch (Webster1.PageURL)
    ElseIf InStr(UCase$(Webster1.PageTitle), UCase$(txtKey)) Then
        AddMatch (Webster1.PageURL)
    End If
    
End Sub

The event provides two parameters, URL and Status, both of which are ignored in this application. The code within the event will work regardless of the Status indicated.

The first line of code sets the fPageLoaded% flag to True to inform the rest of the application that at least one page has been loaded in the Webster control.

The next line of code checks to see whether the Search button is enabled. If the button is enabled, a search is not currently in progress and the code has no reason to parse the page's contents, so the routine exits.

The code next checks the value of the listbox flag, LBFlag, to determine whether the page just loaded should have any links it placed into the anchors listbox (lbAnchor). If the value of this flag is 0, then links are placed into the lbAnchor list box. Otherwise, the code continues.

When the user-specified pages are being loaded (LBFlag = 0), any HTTP links found on those pages are placed into the lbAnchor listbox. The Webster control has a property named GetLinkCount that returns the number of links found on a specified page. Using an empty string as the parameter to the method as is done in Listing 14.5 returns the number of links for the currently loaded page. The control also provides an array of the URLs for these links. This array is accessed using the GetLinkURL property and specifying the parent URL (in this case, an empty string to specify the currently loaded page), and the index to retrieve from the array.

The For...Next loop iterates through all of the links on the page that was just loaded. Only the HTTP links are added to the lbAnchor list box, though, because you'll be using the Webster control to load each of the URLs that gets added to that list box.

After links are loaded, the procedure moves on to actually search the loaded page for the string the user entered into the txtKey text box. The Webster control provides two methods that make this possible. First, determine the size of the text using the GetTextSize method. Again, providing an empty string as the parameter indicates that you're interested in the text size for the current page. The method returns the size of the pure text contained on the page. Any characters contained within HTML tags or occurring outside the <BODY> tags is not considered pure text, and you're also not interested in searching it either. Once the size is determined, the GetText method is used to retrieve all the pure text from the current page. The code then uses the Instr() function to determine if the search string is contained within the text. If the search string is found, AddMatch is called to add the current URL to the found listbox (lbFound). If the string is not found within the text, the code checks for the string in the page's title. Again, if the search string is found, the URL is added to lbFound using AddMatch.

And that, finally, concludes the majority of our search engine.

Viewing Pages


Another feature of the Web Search Tool is that it allows you to load any of the URLs from any of the listboxes into the Webster browser. If a search is not in progress, you can double-click a URL in any of the listboxes and it will be loaded by the Webster control. The code to make this happen is contained in Listing 14.6 but doesn't bear much explanation.

Also, because the Webster control allows the user to click on hypertext links and load the page the link points to, you'll want to disable this feature while a search is in progress. The best way to do this is by using the Webster control's DoClickURL event. By setting the Cancel parameter to True within the event's code is done if the Search button is disabled, you prevent the Webster from loading the page pointed to by the URL that was clicked. Although not applicable for this application, this event can also be used to trap URLs that you want to prevent the user from accessing.

Listing 14.6 Code to load pages from the list boxes.

Private Sub lbURL_DblClick()
    If cmdSearch.Enabled Then Webster1.LoadPage lbURL.Text, False
End Sub
Private Sub lbAnchor_DblClick()
    If cmdSearch.Enabled Then Webster1.LoadPage lbAnchor.Text, False    
End Sub
Private Sub lbFound_DblClick()
    If cmdSearch.Enabled Then Webster1.LoadPage lbFound.Text, False
End Sub
Private Sub Webster1_DoClickURL(SelectedURL As String, Cancel As Boolean)
    'if the search button is off, don't allow clicks
    '  (the program is still searching)
    If Not (cmdSearch.Enabled) Then Cancel = True
    
End Sub

Testing The Application


This application is simple to test. After all the code is entered or copied from the CD-ROM, run the application. Make sure you have either an active Internet connection or have a Web server running locally.



If you're running Windows 95 and would like to run a local Web server, I'd recommend O'Reilly and Associates WebSite server. An evaluation copy is included on the CD-ROM accompanying this book. Another good choice is the FrontPage Personal Web Server that ships with Microsoft's FrontPage Web site editor.

Enter a URL into the URL To Add text box and click the Add URL button. Next, enter a string to search for on the page specified by the URL

Click the Search button and watch the action. You should see the page you specified load into the Webster browser. Then, if there are any links on that page, the Anchor List Box is filled with them and each page is loaded and searched. The URLs for any pages with matches are added to the Matched URLs list box.

Once the search is completed (the Search button turns back on), you can double-click any of the URLs to load the page into the Webster control. Then, use the Webster browser just like any other Web browser to surf to your heart's content.

For example, Figure 14.3 shows the results of searching the URL http://www.infi.net for the string cool. After the search filled the listboxes, I double-clicked the URL in the URL Search List box to re-load the starting page into the Webster control.

Figure 14.3. The Web Search Tool in action.

Other Directions


This sample application is not meant to be a fire-and-forget solution for searching the Web. Quite a few areas could be pursued, and I leave a few suggestions—ideas you can add.


Summary


From here you go on to a client side application (Chapter 15, "LinkChecker: A Spider that Checks for Broken Links") that verifies all of the local links within the page. The Web Search and Link Verifier chapters are very similar and share quite a bit of code. With a knowledge of how the processes work, you can expand the ideas presented in this and the next chapter to your own applications.

Previous Page Page Top TOC Next Page See Page