Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 13gather < prev next >

Wrap

Text File | 1992-06-29 | 23.6 KB | 487 lines

══════════════════════════ 3. DATA GATHERING ══════════════════════════ ════════════════════════════ 3.1 Some definitions ════════════════════════════ Terms like "data" and "information" are often used interchangeably. It would be helpful to distinguish among the following terms and propose working relationships among them: datum ══> data ══> record ══> information ══> knowledge A datum is a single fact or historical observation or calculated value. In itself, a datum has little meaning; it doesn't "inform". The digit '5' is a datum, a statistic without context. The term data is the plural of datum. "Data" is a term used in a very general way for any collection of individual facts, observations, or calculated values. (The same word is also used as a collective singular. One can say, "The data is in the report" OR "The data are in the report".) How many kinds of data are there?: As many as there are phenomena in the universe that can be observed or derived by humans. If we limit the focus to computerized data, we find even that can take ever so many forms... numbers, readable text (words, phrases, sentences, etc.), sounds, pictures or graphics, animation, video sequences, and so forth. A record is sufficient related data to reconstruct an event. Each datum provides context for other data within the record, so that the combined total takes on meaning. Example: The datum "5" out of context tells us virtually nothing. Look what happens when we put it within the "record" of a business transaction: listing five pairs of black Oxford shoes, style D-438, size 10-D, sold to ABC Company on October 20 at $59 per pair. This says something useful, especially to persons who created the record. A record is treated as a single unit for search purposes. When the searcher enters attributes or words or phrases in combination, the retrieval system responds by returning each set of data (each "record") that holds those terms or is described by those attributes. Some more examples of records: a paragraph, an article in a newspaper, a screenful of text, one house in a real estate database, a Bible verse, a dictionary entry, etc. Information is created when a person (or program) searches for, and selects, records according to a purpose. For example, the cumulative statistics combining records for all sales of black Oxford shoes, compared year to year and by region, are informative to a manufacturer who must decide on production plans for various shoe styles. Here we have a selection according to a purpose. Merely browsing through the data does not create information. A purpose is needed to clarify how some records are to be selected and all others rejected. Knowledge is the accumulation of information linked into useful relationships within a human mind. (As a definition, this won't satisfy the teacher in Philosophy 101, but that isn't our purpose.) In a sense, knowledge consists of mutually reinforcing sets of information. Relationship or linkage is the key. Example: The shoe product manager puts together information on past production, current costs, financial condition of the company, status of equipment, available skilled labor, economic forecasts, market trends, analysis by sales people, and a personal awareness of the company and industry. It is this total set of linked information that forms the knowledge on which a decision will be reached. Let's put these ideas together: datum + datum + datum + .... = data enough related data to reconstruct an event = a record records selected with purpose = information information linked in a mind = knowledge ════════════════════════════ 3.2 Why gather data? ════════════════════════════ People and organizations accumulate data because it is a means to create value or add to value. Note that data is a means, not an end in itself. Data is raw material out of which information is derived. Data has value for its potential. But it remains potential until a purpose is applied to select and group the data into useful information. And it need not be one single purpose. Example: Tens or hundreds of thousands of copies of a large metropolitan area telephone book are distributed to the user public. Such a data base may be referred to for a hundred thousand different purposes in a single day. Copied into a personal telephone index or dialed on a telephone, the data takes on the value associated with the purpose for which it was selected. The value is often small... Time is saved, or the toll charge for phoning an Information operator is avoided. Sometimes the value is beyond estimation... Ask a parent who happened to have the number of the local poison control center handy when it was desperately needed. Data has zero value if it is not accessible. Data is a means. The value is according to purpose. If the purpose cannot be applied, there is no value. If you can't find it, it's of no use to you. If you can't find it, you cannot generate information and knowledge with it. Accessibility is the heart of the argument in favor of records and information management, quality indexing, and simple, powerful retrieval methods. Everything in the MIR series aims to add value to data. ═════════════════════════════════ 3.3 Who are data gathers? ═════════════════════════════════ Who are data gatherers? Any person who ever recorded an observation, or collected observations made by others qualifies as a data gatherer. If we leave the definition that broad, every civilization for which we have any recorded history had its data gatherers. Using this definition, even the early cave painters would qualify. Let's narrow the focus somewhat. For our purposes, data gatherers are organizations or persons who put facts and observations into a form that can be manipulated by use of a computer. The data gatherers may create new data, or alternatively collect existing data. In either case, their output is "machine readable". The data may be intended for uses that create value internally within the organization, or there may be possible profit in wider distribution or publication. ═══════════════════════════════ 3.4 Keyboard data input ═══════════════════════════════ To get data into machine readable form, some form of computer software is required. Many computer users are familiar with text processing. Such programs are devices for entering, modifying, deleting and formatting text data. They are particularly useful for continuous text, such as letters and reports. Typesetting software and desktop publishing software are variations that offer extended capabilities to prepare text for widespread distribution. These usually insert a wide variety of codes to control the format of the text. Format controls include underlining, bold text, margin sizes, paragraph indentation, centering and justifying text, font selection, type size, etc. Another method of input, good for highly structured data, is to present the user with a template in which fields may be filled in. For example, here's part of a primitive real estate template: ASKING PRICE, $: ___________ MAP GRID: _____________ HOUSE #: _____ STREET NAME: _________________________ DISTRICT: ____________________ CITY: ___________________ LOT SIZE (sq ft): _________ HOUSE SIZE: _____________ NO. OF BEDROOMS: ___ FIREPLACES: ____ GARAGE UNITS: __ IN-GROUND POOL: ___ ABOVE-GROUND POOL: __ SAUNA: __ So-called "fourth generation" programming languages are well adapted to creating and manipulating these templates. Each template may appear to have its own program. Actually, one program behind the scenes may manipulate data in a variety of templates, putting limits on the kinds of data and the value ranges that are acceptable in many of the fields. ══════════════════════════════ 3.5 Scanned data input ══════════════════════════════ Many records are created by scanning devices. Point of purchase devices interpret universal product symbols; these are increasingly common in grocery and other retail stores. Entire warehouses can be automated with the help of bar code scanners stationed along control points of conveyor belts. Movements of goods are entered as records, with exceptional accuracy and efficiency. Not all scanning works that well. Optical scanners (which look very much like photocopiers) are used to input the text content of sheets of paper or pages of books. Scanning is only as good as the software that is used in conjunction with the scanner AND the quality of the text being scanned. Optical character recognition (OCR) has advanced dramatically with "omnifont" software that recognizes characteristics of letters as opposed to predetermined layouts. Curiously, the quality of printed text may be deteriorating with the spread of desktop publishing. Typeset text normally leaves clear space around each character. Low cost desktop equipment may cause individual letters to run together slightly... especially double letters ('ss' in assembly). I tried scanning a 1976 and a 1991 copy of an annual publication that had switched from typesetting in 1990. The error rate was 3 per page in the 1976 typeset copy, and 103 per page in the 1991 version! (One consolation... If desktop publishing was used, somebody somewhere may have backup of the computer files; in that case, scanning is unnecessary.) Scanning can present difficulty where the page is not a single block. Suppose the page is in three parallel columns. Can the system recognize the switch from one column to the next? Or is text horizontally in line across the columns run together as if it were continuous? Words hyphenated at column ends (and page ends) are particularly vulnerable to error. Early in the 1990s, 99 per cent accuracy in text scanning was considered very good. That may be acceptable for small databases. But think what a one per cent error rate means for a gigabyte of scanned information. Assuming the average word is 7.6 characters long, 1,316,000 words would contain errors. A good portion might be found through comparison with listings of accepted spellings. But a smudge can turn the word "leap" into the entirely different word "heap", and only the most sophisticated software has any chance of catching word substitutions of this sort. Correcting errors in very large databases is not as straight-forward as in the typical letter or report; sheer size creates its own problems. (...Or opportunities! There will be more on data cleaning software in Tutorial FIVE.) Data input is the most labor intensive part of making data accessible on computers. It is the area of greatest cost (barring outrageous royalty charges); input is an area that offers much opportunity for improvement in quality. Here are some considerations prior to scanning a large quantity of material: » If the work has been republished in recent years, was the text newly typeset? If yes, it may be possible to work from the typesetting tape or diskettes. Some desktop publishing systems make it easy to extract ASCII copies. Extracting text from typesetting codes is more complex, but it may be the quickest way to produce a clean copy of the text. » Consider scanning only when there is no really usable machine readable alternative. Search out the best possible copy of the typeface which is to be scanned. The poorer the quality, the higher the error rate. Also use recent scanning software, not more than two years old. » Set a timer as someone proofs a portion of the result. Don't expect a spell checker to provide adequate proofing; very few check the context. Correct spellings of wrong words garble the result with surprising frequency. » If the tests above are within budget, go ahead. Otherwise seriously consider having the whole database entered at keyboards. (Sigh!) ══════════════════════════════════════════════ 3.6 Format, standards and common sense ══════════════════════════════════════════════ From an indexer's point of view, the ideal world would be one in which all computerized data is received in a standard format on a standard, large scale medium with a standard, publicly shared set of markup codes. Notice a word being repeated? It comes from the experience over several years of having to figure out the most incredible variations in the way computer data is assembled. Non-standard media? It's still around; obsolete typesetting systems are the worst offenders in producing media that other machines simply cannot read. A variation is the nine-track tape (so far, so good) that turns out to have been created by back-up software that makes the tape unreadable for any machine not using the same operating system. Wrong scale media? Consider the friendly customer who provides 200 million characters of data on floppy diskettes, 360,000 bytes at a time. At the other extreme, I handled another database of 2.3 billion characters on good nine-track tape; hours were wasted because 2.0 billion characters were blank padding in empty fields. Then there was the neatly formatted hierarchical text database, beautifully ready in every detail but one. The paragraphs were set 90 characters wide. Since the target machine had the standard presentation width of 80 characters, it was back to the drawing board! Why this small digression? Obviously it makes me feel good. Far, far more important... the failure to use standards costs the information industry and the end customer bundles of money. Standards save money! The use of standards and common sense greatly increase the accuracy of cost and time requirement forecasts. Jobs get done on time. The customer is well served. One last thought along this line: The searcher does not need to know the intricacies of a particular standard. What is important is that technical staff accept the responsibility to ensure standards are applied. For example, increasingly there are advantages to using some form of SGML (Standard Generalized Markup Language). It permits the end user control over the way data is presented for viewing on the screen or the way it is printed. The results are pleasing, particularly on computers that allow changes in print size and character fonts. (Again, the results are even more pleasing to the wallet.) ════════════════════════ 3.7 Data quality ════════════════════════ The potential value of data increases with accuracy. The single best protection against errors is neither accuracy checks nor precise verification methods. It's people who care. If there are trained workers with pride of workmanship who are permitted reasonable time to ensure quality, then quality has a fair chance. There is an attitude, all too common among managers, that data entry is a menial job to be done in the cheapest possible way. They get what they pay for... cheap performance. The real cost is borne by the searcher later on. Data entry errors lead to missed records, incomplete search results, and frustration. Some data input systems make accuracy easier. Template based software often includes data type and range checks for each field; this stops many errors at their source. Word processing packages have spell checkers which catch all but word substitution errors. These too should be used as part of the daily entry routine. Quality problems with Optical Character Recognition (OCR) equipment and software were mentioned earlier. Visual checking by a human is the only effective way to ensure validity of scanned numeric data or of words in isolation. Error checking of continuous text can be automated up to a point. But comparison to lists of correctly spelled words is not enough. Some kind of check of nearby vocabulary is needed to catch word substitutions. Since intelligent context checking software is not all that common, the cost of validating scanned input may turn out to be higher than that of the original scanning. Timeliness is another aspect of quality. Have you ever kept receiving mail for the previous occupants of your home, five years after they have moved away? Mailing lists and many other forms of data are vulnerable to obsolescence. Again, the cost of errors is felt, not by the data gatherer, but by the user. Consistency is another quality issue that arises in text data that has been accumulated over an extended period of time. There may have been changes in the software used to enter the data. The change may be only in successive revisions of the software, so there may be reasonable consistency over time. But complete changeovers to different software packages do occur. In gigabyte size databases, the resulting inconsistencies may lie buried until an attempt is made to prepare the data for indexing and search. If so, expect unpredictable and undesirable results. Data quality can be summed up in terms of the willingness of the data gatherer to accept costs to ensure accuracy, timeliness, and consistency. Be ready to ask some tough questions of organizations providing data for you: How was the data gathered? Where was it entered into a computer, and under what conditions? Were the keypunchers working in their first language? What incentives for accuracy were given to keypunchers and to their supervisors? What measures were in place to ensure prompt, accurate updates as data changed? ═════════════════════════ 3.8 Value of data ═════════════════════════ Recall that people and organizations accumulate data because it is a means to create value or add to value. The primary marketing question in data gathering is: For whom? Who will gain by having the data available? What are the characteristics of persons or groups who are most likely to be able to create value using this data? Is a record worth creating in the first place? We don't know, apart from awareness of its potential use. Does the data have inherent worth? Any response is idle speculation, apart from awareness of who is the potential user. The wise data gatherer addresses marketing questions early in planning any new project. The way the data gatherer plans has a direct bearing on the quality and cost of use by the searcher (the end customer). Here is a series of marketing decisions that impact directly. Market capacity: If there are lots of people who already have a felt need for the data being offered, who have the computer equipment and money available, volume pricing might be used right from the start. If the data is specialized, and of interest to relatively few potential users, the market capacity is lower. In this case, expect a "cherry picking" strategy... top price at first to reach those most eager, then successive moderate price drops to broaden the customer base. Cost recovery strategy: How much was the investment in research and development for this project, and how quickly must those funds be recovered? If competition is threatening, these costs must be covered quickly. This boosts initial prices. But expect more dramatic reductions over time. Alternately, overpricing for fast cost recovery may simply kill the market's interest in the product. This happened with great regularity in the early days of the CD-ROM industry. The standing joke was that the only companies making money on CD-ROMs were the mail couriers. Educating the market: If enough prospects recognize the potential value of the data, marketing and sales costs can be held to a moderate level. If on the other hand there must be heavy investment in communicating product benefits and in customer hand holding, these costs must be loaded into the price. Perception of value: It is easy to kill interest in a product by underpricing. "Oh, if it's only that much, it can't be very good." An effective marketing technique (perfected in the cosmetic industry and carried over into information products) is to build a mystique and sense of prestige around use of the product. The other end of the scale is give-away pricing... setting the information product price low or literally free in order to move associated products (usually computer hardware). Value added through combination: A database may attract limited interest in its own right. But combined with other data, whole new applications open up. A telephone book alone is useful for looking up individual addresses and phone numbers. Add mailing codes, type of dwelling, years in that residence, and demographics (relative rankings for small clusters of dwellings for income level, numbers of children, numbers of retired people, etc.) and then the combination proves potent for creating targeted mailing lists. ══════════════════════════ 3.9 Data ownership ══════════════════════════ Data gatherers have an understandable interest in getting paid for their work. Public opinion has been rather casual. Copyright, at least in theory, provides protection for intellectual property. In reality, losses through illicit copying are substantial. The difficulty is that computer data is so very easily copied. Anti-piracy software and encryption of data offer partial protection; what they really do is raise the cost of illegal use high enough to discourage all but the most ardent computer hack. Publishing media such as CD-ROM raise the cost by the sheer volume of data. Who would want to copy 600 megabytes onto hard disk? The worst nightmare for the data gatherer is the offshore commercial pirate who produces forged product and introduces it into the domestic market at lower prices. ═══════════════════ 3.10 Summary ═══════════════════ Collecting and entering data into a computer is the first stage in enabling people to find information quickly and easily in a gigabyte world. Data is raw material, selected according to a searcher's purposes, to create useful information. Data takes a variety of forms. Text data, that is, any data that can be entered through a keyboard, can be prepared for search much more readily than graphic or sound data. Methods of input directly affect the quality of data, and hence its potential value for the searcher. Use of standard media and data formatting dramatically lower the costs of preparation for search. Marketing issues affect the cost and ultimately the quality of data products that are available for search.