HTML Screen Scrapping instructions are expressed declaratively within an SDL contract using a <text> child tag nested within the <response> tag of a service description. The <text> tag in turn can contain multiple name/value regular expression pairs that identify the specific values (and their programmatic names) that should be pulled out of the response.
The <text> tag supports a single attribute, contentType, that enables filtering based on the return content mimetype. The <text> tag can in turn have any number of <regex> tags, each of which represents a separate screen scrapped value, nested within it.
<text contentType='mime type'> <match pattern='regex' name='name' maxOccurs='integer or *' group='regex group' /> </text>
The <match> tag in turn supports four attributes:
name | Optional. Variable name of resulting value obtaining from corresponding regular expression |
pattern | Regular expression to run in order to obtain value. Any valid PERL 5 regular expression can be used. |
group | Optional. |
maxOccurs | Optional. Indicates number of values that should be returned from regular expression (default is 1 – the first match found). |
The following is a sample SDL file that specifies grabbing the last price as well as the change for a stock quote on Moneycentral’s website.
<?xml version="1.0"?> <serviceDescription xmlns:s0="http://tempuri.org/main.xsd" xmlns:s1="" name="Investor" targetNamespace="" xmlns="urn:schemas-xmlsoap-org:sdl.2000-01-25"> <httpget xmlns="urn:schemas-xmlsoap-org:get-sdl-2000-01-25"> <service> <requestResponse name="GetQuote" href="http://moneycentral.msn.com/scripts/webquote.dll"> <request> <param name='Symbol'/> </request> <response> <text> <match name='Last' pattern='Last</TD>.*?;(.*?)</B>'/> <match name='Change' pattern='Change</TD>.*?;(.*?)</B>'/> </text> </response> </requestResponse> </service> </httpget> </serviceDescription>
This SDL then would create the following proxy class code.
namespace Services { using System.Xml.Serialization; using System; using System.Web.Services.Protocols; public class Investor : HttpClientProtocol { public Investor() { this.Path = "http://moneycentral.msn.com/scripts/webquote.dll"; } [HttpMethod(typeof(TextReturnReader), typeof(UrlParameterWriter))] public Matches GetQuote(string symbol) { return (Matches)Invoke("GetQuote", this.Path + "", new object[] {symbol}); } public IAsyncResult BeginGetQuote(string symbol, AsyncCallback callback, object asyncState) { return BeginInvoke("GetQuote", this.Path + "", new object[] {symbol}, callback, asyncState); } public Matches EndGetQuote(IAsyncResult asyncResult) { return (Matches)EndInvoke(asyncResult); } } public class Matches { [Match("Last</TD>.*?;(.*?)</B>")] public string Last; [Match("Change</TD>.*?;(.*?)</B>")] public string Change; } }
Which could then be called with the following code:
using Services; using System; public class Scrape { public static void Main(string[] args) { Investor investor = new Investor(); Matches matches = investor.GetQuote(args[0]); Console.WriteLine(matches.Last); Console.WriteLine(matches.Change); } }