TextSpresso Manual Menu

Extract Data Filter

Overview

How It Works

How To Build It

The Extract Data Editor

Mark & Process Strategy

Overview

This chapter explains the Extract Data filter type. An Extract Data filter is used to extract text, which matches one or more patterns, from the surrounding text in a document. We refer to this as multi-pattern data extraction.

This filter type can be used to perform very complex extractions. For example, TextSpresso ships with an Extract Data filter for extracting all of the URL's from a document. You can extract complex patterns of text and, using a mark and process strategy, extract blocks of text containing specific patterns.

How It Works

An Extract Data filter is actually a special type of MultiFilter. An Extract Data filter tells each filter in its Filter List to find all matches in the text. It then deletes everything which did not match a filter in its list, so that you are left only with the text you wanted to preserve.

Five types of filters may be added to the Filter List and used to extract data. They are: Replace Text, Replace Pattern, Pattern Insert, Word Table, and MultiFilter. Only the 'find' portion of the above filters is used during an extraction. Matching text is preserved, but not replaced or modified.

Note: the editor does not prevent you from adding filters of other types, or MultiFilters which contain filters of other types. But filters which do not search are ignored during extraction.

Note: an Extract Data filter can contain an Extract Data filter. But during extraction the contained filter behaves just like a contained MultiFilter, i.e. it doesn't insert its before/after text, it only finds and returns matches.

How To Build It

By itself an Extract Data filter does nothing. You must add filters to its Filter List which define the patterns to be matched and preserved. For example, say you wanted to create a filter to extract all instances of the word "cat" from a document. First you would create an Extract Data filter. Then, in this new filter's Filter List, you would create a new Replace Text filter and type the word "cat" in the Find field of the new Replace Text filter. (This is, obviously, a useless but simple example. To extract useful data you will normally use Replace Pattern filters and/or Replace Text filters which contain wild cards.)

The Extract Data Editor

When you open an Extract Data editor and switch to the Filter tab, you will see a Filter List; buttons to add, edit, and delete filters in the Filter List; a check box; and two text fields.

The Filter List and editing buttons operate exactly the same as they do in The MultiFilter Editor. You can learn how to edit the Filter List by reading the section The MultiFilter Editor in the MultiFilter chapter.

The check box is titled Extract Sequentially? This check box controls how the filter proceeds to extract data. If Extract Sequentially? is unchecked (default) then all matches returned by all filters are merged before the unmatched text is deleted. In other words, text matching any of the filters is preserved. If Extract Sequentially? is checked, then the Extract Data filter processes the text one filter at a time. It gets the matches from the first filter, deletes the unmatched text, then passes the resulting data to the second filter, and so on.

Extracted text can 'run together' into a garbled mess if it's not already marked off within the source text. For this reason you may also add characters to be inserted around each match in the Insert Before and Insert After fields. Use these fields to add 'markers' around each match so that the extracted data does not run together. You'll usually want to add a carriage return or tab before or after each match to separate data.

Mark & Process Strategy

What happens when you want to extract, say, lines of text which contain a specific word or pattern? It can be difficult or impossible to specify a pattern which matches large blocks of text containing sub text and virtually any text before/after that sub text. To see why, take a look at this pattern:

( "CR" TRUE 1 1 ) - find the beginning of a line.
( "c" FALSE 0 0 ) - match text which isn't the start of our identifier "cat".
( "c" TRUE 1 1 ) - match c.
( "a" TRUE 1 1 ) - match a.
( "t" TRUE 1 1 ) - Uh oh! What if we've found cab? The pattern will fail even if cat occurs later in the line.
( "CR" FALSE 0 0 )

For such tasks it's best to use a strategy we call mark and process.

In this strategy you create a MultiFilter containing three or more filters. The first filter is a Pattern Insert filter (or Replace Text filter if your identifier is fixed length) which looks for your identifying text and marks it off by adding NUL before or after the text. (NUL is a good marker because it's never used in text.) The second filter is an Extract Data filter which extracts all of the blocks containing NUL. This type of pattern is far easier to design and less error prone. The last filter is Strip NUL so that the markers aren't left behind, which could confuse other filters.

For example, say you want to write a filter which extracts all lines containing the word cat. First you would create a standard MultiFilter (not an Extract Data filter). The first filter in this MultiFilter would be a Pattern Insert filter which looks for "cat" and then inserts a NUL after the word. Now you can easily grab lines which contain NUL. Your second filter in the MultiFilter would be the Extract Data filter "Extract Lines With NUL". We've created this filter for you, but it's important to know how it works so that you can Extract other blocks with NUL markers.

Extract Lines With NUL is an Extract Data filter which contains Strip Lines With NUL. (Remember that filters in an Extract Data filter's Filter List are only used to search, their replace operations are ignored.) The Strip Lines With NUL pattern is as follows:

( "CR" TRUE 1 1 ) - Find the beginning of a line.
( "CRNUL" FALSE 0 0 ) - Match any characters but NUL or the end of the line.
( "NUL" TRUE 1 1 ) - There has to be a NUL before the end of the line, or this line isn't marked.
( "CR" FALSE 0 0 ) - Match any characters but the end of the line.

This pattern will match any line with a NUL character regardless of the text before/after the NUL. It's an easy to write and verify pattern.

Since you've marked off all lines containing cat with NUL, extracting those lines is now very simple.