NGWS SDK Documentation  

This is preliminary documentation and subject to change.
To comment on this topic, please send us email at ngwssdk@microsoft.com. Thanks!

SDL Text Pattern Matching Syntax

HTML Screen Scrapping instructions are expressed declaratively within an SDL contract using a <text> child tag nested within the <response> tag of a service description. The <text> tag in turn can contain multiple name/value regular expression pairs that identify the specific values (and their programmatic names) that should be pulled out of the response.

<text> Formal Syntax Definition

The <text> tag supports a single attribute, contentType, that enables filtering based on the return content mimetype. The <text> tag can in turn have any number of <regex> tags, each of which represents a separate screen scrapped value, nested within it.

<text contentType='mime type'>  
  <match pattern='regex' name='name' maxOccurs='integer or *' group='regex group' />
</text>

The <match> tag in turn supports four attributes:

name Optional. Variable name of resulting value obtaining from corresponding regular expression
pattern Regular expression to run in order to obtain value. Any valid PERL 5 regular expression can be used.
group Optional.
maxOccurs Optional. Indicates number of values that should be returned from regular expression (default is 1 – the first match found).

Text Pattern Matching Example

The following is a sample SDL file that specifies grabbing the last price as well as the change for a stock quote on Moneycentral’s website.

<?xml version="1.0"?>
<serviceDescription xmlns:s0="http://tempuri.org/main.xsd" xmlns:s1="" name="Investor" targetNamespace="" xmlns="urn:schemas-xmlsoap-org:sdl.2000-01-25">
  <httpget xmlns="urn:schemas-xmlsoap-org:get-sdl-2000-01-25">
    <service>
      <requestResponse name="GetQuote" href="http://moneycentral.msn.com/scripts/webquote.dll">
        <request>
         <param name='Symbol'/>
        </request>
        <response>
         <text>
          <match name='Last' pattern='Last&lt;/TD&gt;.*?;(.*?)&lt;/B&gt;'/>
          <match name='Change' pattern='Change&lt;/TD&gt;.*?;(.*?)&lt;/B&gt;'/>
         </text>
        </response>
      </requestResponse>
    </service>
  </httpget>
</serviceDescription>

This SDL then would create the following proxy class code.

namespace Services {
    using System.Xml.Serialization;
    using System;
    using System.Web.Services.Protocols;
  
    public class Investor : HttpClientProtocol {
        public Investor() {
            this.Path = "http://moneycentral.msn.com/scripts/webquote.dll";
        }
        [HttpMethod(typeof(TextReturnReader), typeof(UrlParameterWriter))]
        public Matches GetQuote(string symbol) {
            return (Matches)Invoke("GetQuote", this.Path + "", new object[] {symbol});
        }
        public IAsyncResult BeginGetQuote(string symbol, AsyncCallback callback, object asyncState) {
            return BeginInvoke("GetQuote", 
                               this.Path + "", 
                               new object[] {symbol}, 
                               callback, 
                               asyncState);
        }
        public Matches EndGetQuote(IAsyncResult asyncResult) {
            return (Matches)EndInvoke(asyncResult);
        }
    }
  
    public class Matches {
        [Match("Last</TD>.*?;(.*?)</B>")]
        public string Last;
        [Match("Change</TD>.*?;(.*?)</B>")]
        public string Change;
    }
}

Which could then be called with the following code:

using Services;
using System;

public class Scrape {
    public static void Main(string[] args) {
        Investor investor = new Investor();
        Matches matches = investor.GetQuote(args[0]);
        Console.WriteLine(matches.Last);
        Console.WriteLine(matches.Change);
    }
}