home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.databases.ingres
- Path: sparky!uunet!elroy.jpl.nasa.gov!sdd.hp.com!saimiri.primate.wisc.edu!aplcen.apl.jhu.edu!news
- From: mae@aplpy.jhuapl.edu (Mary Anne Espenshade)
- Subject: Copying data into INGRES
- Message-ID: <1993Jan22.220936.27618@aplcen.apl.jhu.edu>
- Sender: news@aplcen.apl.jhu.edu (USENET News System)
- Organization: Johns Hopkins University Applied Physics Lab
- Date: Fri, 22 Jan 93 22:09:36 GMT
- Lines: 117
-
- I have an interesting problem to deal with involving getting data into
- an INGRES database. Explaining it may take a while though, and
- requires a good bit of UNIX background as well as INGRES copy knowledge,
- everyone else can bail out now...
-
- I administer (and am writing a Windows 4GL query interface to) a
- database that is updated quarterly from an outside source. Each update
- tape is a complete replacement for all the data. Each file on the tape
- is a single table. I'm running on an HP UNIX system, but the provider
- of the tape is apparently on VMS, because the only 9 track tape formats
- we can get are VMS formated or unlabeled. I get the unlabeled tape and
- have a UNIX shell script to read it in using dd. Since there are nearly
- 300 tables in this database, that means nearly 300 unnamed files on this
- tape, so the script has to keep track of where it is so I can identify
- which file will go into which table.
-
- The data used to be in fixed length fields with no delimiters between
- them and a carriage return/linefeed pair at the end of each record. My
- script that would read a file from the tape, pass it through a tr
- command that would replace the two line delimiters with a single newline
- and name the file appropriately so I could then read it with an SQL
- copy script. I had nearly 10,000 lines of copy scripts, with the fixed
- length format for each field of each table and matching
- "with null(' ...many spaces... ')" clauses where needed. I could load
- the entire database with a makefile in about 3 hours.
-
- Not too surprisingly, the organization making these tapes decided to
- switch to a variable length format. Lots of empty 240 character
- comment fields padded with blanks wastes a lot of tape space. My
- problem is the particular format they came up with. Now, each record
- starts with a four digit record length, not followed by a delimiter,
- and then the fields of the record, separated by commas, with each
- character field within double quotes. There is still a carriage
- return/linefeed pair at the end of each record.
-
- I saw two ways to tackle this beast, rewrite my 10,000 lines of copy
- commands to decode the possible multiple delimiters around each field
- individually or rewrite my tape reading script to process the files into
- a format copy could deal with more easily. I didn't want to face the
- magnitude of the changes to copy and thought I saw a way a few passes
- through sed and/or tr could fix the files. Comma was a really
- unfortunate choice for the field delimiter, as there are many commas in
- the text fields as well, and how they got double quotes instead of
- single out of an SQL-based system (not Ingres, I think it's maintained
- on either Oracle or Sybase) at the other end I have no idea. I verified
- that there were no tabs in the data and decided to make that the field
- delimiter. A few global changes converted the copy scripts to all have
- the format: (dummy = d4, /* to dump the 4 digit record length */
- .... = c(0)tab,
- .... = c(0)tab with null(''),
- .... = c(0)nl)
-
- Sample record as input (each file has a different number and order of
- fields of course, except that the first and last fields are always
- character strings):
- 1234"key field","text",10,20,"more text","still, more, text"(CR)(LF)
- What I wanted:
- 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still, more, text(NL)
-
- I ended up with a shell script that makes 10 passes over each file:
- 1) the original "tr -s" to replace the CR/LF pairs with NL
- 1234"key field","text",10,20,"more text","still, more, text"(NL)
- 2) remove the first double quote
- 1234key field","text",10,20,"more text","still, more, text"(NL)
- 3) remove the last double quote
- 1234key field","text",10,20,"more text","still, more, text(NL)
- 4) replace any ,", with ^A", (the ^A is a placeholder for a real text
- comma, some of the continued character fields can have data ending
- with a comma. I missed this the first time around and had to redo
- the script. Apparently the source DBMS has a maximum character
- field length of 240 characters, so there are tables with fields
- description1, description2 and description3, each varchar(240).)
- 1234key field","text",10,20,"more text","still, more, text(NL)
- 5) replace any "," with a tab
- 1234key field(tab)text",10,20,"more text(tab)still, more, text(NL)
- 6) replace any ," with a tab
- 1234key field(tab)text",10,20(tab)more text(tab)still, more, text(NL)
- 7) replace any ", with a tab
- 1234key field(tab)text(tab)10,20(tab)more text(tab)still, more, text(NL)
- 8) replace any ,SP with a ^ASP (this is the problem assumption)
- 1234key field(tab)text(tab)10,20(tab)more text(tab)still^A more^A text(NL)
- 9) replace any remaining commas with tabs
- 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still^A more^A text(NL)
- 10) replace the ^A place-holders with commas.
- 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still, more, text(NL)
-
- Needless to say, on the one huge table with 96,000+ records, this takes
- a while.
-
- As noted on step #8, I had made a hopeful assumption that turned out to
- be wrong. I thought all the commas in the text fields would be in
- sentence-structured text, with a space after each comma. This was true
- for the comment and description type fields (with the exception that
- added step #4) but was not true for several shorter character fields,
- where lists of values had been crammed in with commas between them but
- no spaces. The really irksome thing is that this data has major
- consistancy problems, so a field might have a value expressed as 'a/b/c'
- most of the time, but it sometimes shows up as 'a,b,c' or 'a-b-c'.
-
- The first few files I loaded didn't have too many problems, and since
- copy tells me which lines it can't read and I have them coded
- "with on_error=continue, rollback=disabled", I thought I could go back
- and fix the offending lines individually and reload them. Then I hit
- the file with 11765 records, 8588 of which failed. I need another
- way to process this tape and I don't see what it is. Is there another
- way to use copy that I'm not seeing, short of making extra dummy fields
- to get rid of all the extra delimiters? Or do I really have to edit
- 9876 lines of SQL and change 80% of them to individual formats.
-
- To anyone who actually made it through this description, sorry for
- being so long winded. :-) Thanks for any help/ideas.
-
- e-mail replies to:
- --
- Mary Anne Espenshade
- ARPA: mae@aplpy.jhuapl.edu or mae@aplexus.jhuapl.edu
- UUCP: ...!allegra!mimsy!aplcen!aplpy!mae
-