NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / database / ingres / 2347 < prev next >

Wrap

Text File | 1993-01-23 | 6.7 KB | 128 lines

Newsgroups: comp.databases.ingres Path: sparky!uunet!elroy.jpl.nasa.gov!sdd.hp.com!saimiri.primate.wisc.edu!aplcen.apl.jhu.edu!news From: mae@aplpy.jhuapl.edu (Mary Anne Espenshade) Subject: Copying data into INGRES Message-ID: <1993Jan22.220936.27618@aplcen.apl.jhu.edu> Sender: news@aplcen.apl.jhu.edu (USENET News System) Organization: Johns Hopkins University Applied Physics Lab Date: Fri, 22 Jan 93 22:09:36 GMT Lines: 117 I have an interesting problem to deal with involving getting data into an INGRES database. Explaining it may take a while though, and requires a good bit of UNIX background as well as INGRES copy knowledge, everyone else can bail out now... I administer (and am writing a Windows 4GL query interface to) a database that is updated quarterly from an outside source. Each update tape is a complete replacement for all the data. Each file on the tape is a single table. I'm running on an HP UNIX system, but the provider of the tape is apparently on VMS, because the only 9 track tape formats we can get are VMS formated or unlabeled. I get the unlabeled tape and have a UNIX shell script to read it in using dd. Since there are nearly 300 tables in this database, that means nearly 300 unnamed files on this tape, so the script has to keep track of where it is so I can identify which file will go into which table. The data used to be in fixed length fields with no delimiters between them and a carriage return/linefeed pair at the end of each record. My script that would read a file from the tape, pass it through a tr command that would replace the two line delimiters with a single newline and name the file appropriately so I could then read it with an SQL copy script. I had nearly 10,000 lines of copy scripts, with the fixed length format for each field of each table and matching "with null(' ...many spaces... ')" clauses where needed. I could load the entire database with a makefile in about 3 hours. Not too surprisingly, the organization making these tapes decided to switch to a variable length format. Lots of empty 240 character comment fields padded with blanks wastes a lot of tape space. My problem is the particular format they came up with. Now, each record starts with a four digit record length, not followed by a delimiter, and then the fields of the record, separated by commas, with each character field within double quotes. There is still a carriage return/linefeed pair at the end of each record. I saw two ways to tackle this beast, rewrite my 10,000 lines of copy commands to decode the possible multiple delimiters around each field individually or rewrite my tape reading script to process the files into a format copy could deal with more easily. I didn't want to face the magnitude of the changes to copy and thought I saw a way a few passes through sed and/or tr could fix the files. Comma was a really unfortunate choice for the field delimiter, as there are many commas in the text fields as well, and how they got double quotes instead of single out of an SQL-based system (not Ingres, I think it's maintained on either Oracle or Sybase) at the other end I have no idea. I verified that there were no tabs in the data and decided to make that the field delimiter. A few global changes converted the copy scripts to all have the format: (dummy = d4, /* to dump the 4 digit record length */ .... = c(0)tab, .... = c(0)tab with null(''), .... = c(0)nl) Sample record as input (each file has a different number and order of fields of course, except that the first and last fields are always character strings): 1234"key field","text",10,20,"more text","still, more, text"(CR)(LF) What I wanted: 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still, more, text(NL) I ended up with a shell script that makes 10 passes over each file: 1) the original "tr -s" to replace the CR/LF pairs with NL 1234"key field","text",10,20,"more text","still, more, text"(NL) 2) remove the first double quote 1234key field","text",10,20,"more text","still, more, text"(NL) 3) remove the last double quote 1234key field","text",10,20,"more text","still, more, text(NL) 4) replace any ,", with ^A", (the ^A is a placeholder for a real text comma, some of the continued character fields can have data ending with a comma. I missed this the first time around and had to redo the script. Apparently the source DBMS has a maximum character field length of 240 characters, so there are tables with fields description1, description2 and description3, each varchar(240).) 1234key field","text",10,20,"more text","still, more, text(NL) 5) replace any "," with a tab 1234key field(tab)text",10,20,"more text(tab)still, more, text(NL) 6) replace any ," with a tab 1234key field(tab)text",10,20(tab)more text(tab)still, more, text(NL) 7) replace any ", with a tab 1234key field(tab)text(tab)10,20(tab)more text(tab)still, more, text(NL) 8) replace any ,SP with a ^ASP (this is the problem assumption) 1234key field(tab)text(tab)10,20(tab)more text(tab)still^A more^A text(NL) 9) replace any remaining commas with tabs 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still^A more^A text(NL) 10) replace the ^A place-holders with commas. 1234key field(tab)text(tab)10(tab)20(tab)more text(tab)still, more, text(NL) Needless to say, on the one huge table with 96,000+ records, this takes a while. As noted on step #8, I had made a hopeful assumption that turned out to be wrong. I thought all the commas in the text fields would be in sentence-structured text, with a space after each comma. This was true for the comment and description type fields (with the exception that added step #4) but was not true for several shorter character fields, where lists of values had been crammed in with commas between them but no spaces. The really irksome thing is that this data has major consistancy problems, so a field might have a value expressed as 'a/b/c' most of the time, but it sometimes shows up as 'a,b,c' or 'a-b-c'. The first few files I loaded didn't have too many problems, and since copy tells me which lines it can't read and I have them coded "with on_error=continue, rollback=disabled", I thought I could go back and fix the offending lines individually and reload them. Then I hit the file with 11765 records, 8588 of which failed. I need another way to process this tape and I don't see what it is. Is there another way to use copy that I'm not seeing, short of making extra dummy fields to get rid of all the extra delimiters? Or do I really have to edit 9876 lines of SQL and change 80% of them to individual formats. To anyone who actually made it through this description, sorry for being so long winded. :-) Thanks for any help/ideas. e-mail replies to: -- Mary Anne Espenshade ARPA: mae@aplpy.jhuapl.edu or mae@aplexus.jhuapl.edu UUCP: ...!allegra!mimsy!aplcen!aplpy!mae