Index


RISC World

Impression to HTML

Dave Holden describes his Impression to HTML converter



Impression to HTML

This program was originally written to enable me to convert long multi-chapter manuals for programs written in Impression into multi-file HTML documents complete with illustrations and an Index file with links to appropriate places in the text. Other people have a similar requirement and have probably found, like me, that the various other tools available to do this just don't do a very good job and require too much editing and alteration to the HTML afterwards.

Companion programs

There is a program to do the same for Ovation Pro. This is functioning perfectly but still needs some work to make it 'user friendly'. This will be published on the next edition of RISCWorld.

There is also HTML-DTP which does the opposite, that is, it takes an HTML file or series of files and converts them into a form suitable for loading into Impression or Ovation Pro.

Finally there is HTM_link. This is used to insert image tag and links into HTML files and makes this very easy and avoids those silly mistakes we all make when 'hand coding' HTML.

What will it convert?

This program does not attempt to produce a DTP-like HTML file. This is simply not appropriate. HTML is not really a suitable medium for complex layout structures. The original aim was to convert program manuals, where the layout is fairly straightforward, and this it does very well,

Imp-HTML is not suited to 'fancy' documents with multi-column text running around lots of pictures or where formatting and layout are of prime importance. This can be done in HTML with frames, tables, specifying fonts and font sizes, etc. and some programs, especially on PCs, try to do this. What you usually end up with is a complicated, bloated, HTML structure that will look lovely in its target browser and platform (usually Internet Explorer on a PC) and take ages to render (if it renders at all) in other browsers on other platforms and often look nothing like the original anyway

Briefly, if you have something with more pictures than text, a complex layout and which requires specific fonts or font sizes then this program will not be able to reproduce the document in HTML. If it's a book-like document which is mainly text with or without illustrations then Imp-HTML will almost certainly do it.

How Imp-HTML works

The program does not use the original Impression document file. Instead it works with text files saves from Impression. If you use 'Save text story' and ensure that 'With styles' is ticked then Impression will save the complete text story with all its Styles and effects, so headings, bold and italicised text, etc. can be correctly interpreted.

The disadvantage of this system is that all graphics are lost. However, in practice this isn't really a problem. With a printed manual the writer will often include a lot of graphics that are not absolutely necessary simply because it's easy to do so. Too many graphics in an HTML document can make it slow to download or render and, because you don't have such exact control over layout, may make it look very messy.

So, by separating the conversion of the textual part of the document from the graphics you have precise control over what graphics are inserted, where, in what format and whether you wish to apply a scaling factor. For example, with a booklet it may be best to position a graphic slightly before or after the text which refers to it to suit page layout. As the reader will have the pages open before them when reading it won't matter if they have to look across at a facing page to see the picture. With HTML, if the picture is in the same place relative to the text, this might well place it completely out of sight when reading the document on screen.

What needs to be done

There are several main operations that must be performed when converting an Impression Text Story to HTML;

  • Any 'special' characters such as the pound sign, < and >, etc. must be converted to their appropriate HTML equivalents.
  • Bold, italic and centred text must have appropriate tags inserted.
  • The various styles used in the Impression documents need to be 'mapped' to suitable HTML 'styles' so that headings, sub headings, etc. are given appropriate HTML equivalents.
  • Create an 'index' which provides links to selected headings and sub headings and break the file up into chapters if required.

Some of these operations will be the same for almost all Impression files. For example, character translation and bold and italic effects will be the same in any document. Others, such as mapping Styles to HTML, when to start a new chapter and which items should appear in the index file may be different for each Impression document.

All of these parameters are set in external text files within !Imp-HTML so they are not fixed. To cope with different Impression files you can have many different mapping files (the files which define which Impression Styles are mapped to which HTML tags and which appear in the Index). These can be selected from a menu in Imp-HTML which also includes tools to make creating these files very simple.

Using the program

Before you can convert an Impression file you need an appropriate Mappings file. A 'basic' file named 'DefaultMap' is provided and there is also one called 'Manual' which was designed to work with the format used to write this manual. As this TextStory is provided with the program you can use it with this mapping file to make an HTML version of the manual to demonstrate how the program works.

Load Imp-HTML by double-clicking on the application icon and it will install itself on the icon bar. If you click MENU on this icon a menu will appear with the usual 'Info' and 'Quit' items at top and bottom and three others, 'Process', 'Mappings' and 'Save Choices'. More about the others later, but for the moment select 'Process' and the window below will open. You can also open this window by clicking SELECT on the iconbar icon.

At this stage there is a 'greyed out' directory icon on the right hand side. Drag the Impression TextStory to the icon at the top of the window labelled 'Source file'. Its name will appear there, the directory icon will become 'active' and the instruction 'Enter name and drag dir. icon to convert file' will appear at the bottom.

Output file options

The 'box' at the bottom left of the window defines the form the HTML file(s) will take. There are three options;

  • Separate chapters will divide the HTML into chapters at predetermined points with a separate index file.
  • File + Index will create two files, one called BODY which contains the main document and a separate INDEX file.
  • Single file makes a single long HTML file with an index at the start of the file

The first two methods create a directory in which the files are placed. The name shown in the 'Save as' icon is the name of this directory. If there are any graphics files to be included these can then be placed in this directory or a sub-directory.

The third method, 'Single file', creates a single HTML file so there's no need for a directory. In this case the name in the Save as icon will be the name of the HTML file and the icon will change from a directory to an HTML file icon.

Chapter numbers

When you split your file up into separate chapters they are given names with a numeric suffix. These would normally be in the form CHAP01, CHAP02, CHAP03 etc. If the original file also has chapter numbers and you want to arrange for the file names to correspond with them they might be out of step if there is something such as a Preface or Introduction before 'Chapter 1'.

This option enables you to adjust the numbers given to the HTML 'chapter' files. By clicking on the 'bump' icons you can make the numbers begin with '00' or even negative numbers, so that if your document does have a Preface that would be CHAP00 so that Chapter 1 of the document would correspond with CHAP01.

Adjust headings

When you use a Style to set a heading or sub-heading in Impression the code to set the style is normally placed at the start of the line but the code to end it will usually be placed at the start of the following line. When this is converted to HTML it will (usually) look OK in a browser but it produces messy code.

If this option is ticked then the program will try to adjust this by moving the code to cancel the style to the end of the preceding line. This produces 'cleaner' HTML which makes it much easier if you wish to edit or adjust it manually later.

Capitalise files

If this is ticked then filenames will be in upper case, CHAP01, CHAP02, etc. If it isn't ticked the filenames will be in lower case, chap01, chap02, etc.

In either case all HTML links in the index file will be in the same case.

This is largely cosmetic as RISC OS and Windows filing systems aren't case sensitive However, Linux and Unix are, so if, for example, you were to transfer files to a case sensitive system (and remember that most web servers are Unix based) via a DOS format disc then it would be best to make sure that all the links are upper case.

.HTM extension

If this is ticked then all files will be given a .HTM file extension (or .htm if 'Capitalise files' is not selected). This would be normal unless you are sure that your files will only be used on a RISC OS system.

Auto UL

If this is ticked then the program will try to create <UL> constructs.

If it finds a line starting with a 'bullet' character (ascii 143) it will begin a <UL> construct and each bullet will be translated into a <LI>. It will close the construct by inserting a </UL> when it finds a line that doesn't begin with a bullet. If there is a Tab character following the bullet, as is normally done in Impression, this will be ignored.

This system is not infallible, but it does normally work quite well. The most common problem is if you have one of the 'bullet sections' divided into two paragraphs. This will result in the HTML being terminated and re-started. However, you will normally be able to see this in the finished product and correct it manually.

Choosing a Mapping file

If you click on the menu icon to the right of the icon labelled 'Mapping' a menu will open listing all the Mapping files available. With the basic program there will be only two, the 'DefaultMap' and 'Manual', which is provided to enable you to experiment with the text of this manual. To choose a Mapping file just select it from the menu and it will be loaded and its name will appear in the icon.

A demonstration

Open the 'Process' window and drag the 'TextStory' file supplied with this program, which is the text of this manual, to the icon at the top of the window. Select the 'Manual' Mapping file and set 'Adjust headings' and 'Auto UL' so they are ticked. Select 'Separate chapters' and '1st chapter' to '1'.

Now drag the directory icon to a suitable filer window.

The icon at the bottom of the window will inform you as the program creates each chapter and will say 'Finished' when it is done. At this point the directory icon will once again be 'greyed out'.

Unless you have changed the name in the Save as icon a directory called HTML will have been created and in it will be several HTML files. Their exact names will depend on the sate of the 'Capitalise files' and 'HTM extension' icons, but there will be an Index file and a file for each chapter.

If you load the Index file into a browser you should find that it contains a series of HTML links. The 'major' ones will be to each individual chapter. The start of each 'chapter' has nothing to do with Chapters in Impression. They are defined (in this case) by the chapter headings, such as the 'Using the program' heading at the start of this section. The 'minor' links are to the sub headings, such as 'A demonstration' above.

If you click on any of these links the browser will load the appropriate file and, in the case of the minor links, scroll to the sub-heading.

You could experiment with the other options. If you select 'Single file' you will find that the program creates a single, long, HTML file with the index at the beginning.

Saving Choices

If you select 'Save choices' from the iconbar menu then the settings in the lower part of the window will be saved in a Choices file. The next time you run Imp-HTML they will then be set they way you have chosen.

Editing a Mapping file

This is actually the most complex part of the program.

If you click ADJUST on the icon bar icon or select 'Mappings' from the menu the mappings window will open.

There's a lot going on in this window. You will see that it's divided into three main areas.

At the top left are the 'Effects'. This is where you set the HTML codes that will be used for Bold, Italic, Underline, Centre and Tab.

At the top right is the section which enables you to load existing Mappings files for modification and to save files.

The main part of the window is where Impression Styles are given the characteristics that will be used in HTML documents.

Altering the definitions

You could create a completely new Mapping file but it will be easier to begin by looking at a completed one. Click on the menu icon labelled 'Edit file' and a menu will appear with the names of all the available Mapping files. At present this will just show 'DefaultMap' and 'Manual'. Select 'Manual' and the file will be loaded and its definitions will appear in the window and its name will be shown in the icon under the 'Edit file' label. The 'Effects' section of the window should now look like this.

Editing the Effects

The top four items, Bold, Italic, Underline and Centre each have an icon where the HTML code used to switch the effect on and off is shown. You can enter these affects either by typing them in directly or by selecting from the menu that will appear if you click on the corresponding menu icon.

The top item on the menu is 'Clear', and if you click on this the icons for that item will be cleared ready for you to enter or select a new code.

The last item is 'Newline'. This doesn't put a <BR> code in the HTML to break the line, it puts a '\n' code in the Effect definition which just places a newline character (ascii 10) in the file so the source file will have the line split at that point. This is done just to make the source file easier to read and edit. For example, it is used with the 'centre' code to make centred lines more obvious.

The last item, Tab, obviously requires only one icon. Again there is a menu but it's much simpler. As there's no direct equivalent to the Tab character in HTML and multiple spaces are concatenated to a single space the best option if you must have some sort of Tab is multiple &nbsp; codes (non-breaking spaces).

The final two buttons in this window are 'Clear' and 'Default'

'Clear' will clear all the Effects icons ready for you to select or type new codes.

'Default' will insert a series of default codes as shown

Editing the Styles

The Styles section of the window is shown below.

The icon labelled 'Style name' shows the name of the Style as it appears on Impressions Style menu. The other two 'white' icons show the HTML codes that will be substituted for the 'on' and 'off' commands for these codes. These icons, like those for the effects, are writable, so they can be entered (or added to) manually or selected from a menu. If you click on the menu icon on the right the menu shown will appear, and selecting anything from this menu adds the appropriate 'on' and 'off' codes to the contents of the icons.

Instead of describing the actions of all the icons I'll describe each Style in the example above and explain how it is used and what effect the various buttons have on it.

First there is Main Heading. In the original Impression document this is centred 28pt bold, and is used at the start of each main section or 'chapter'. So, <H2> has been chosen, and it is also centred. Because the use of this Style signifies the start of a new chapter the icon on the extreme right, labelled 'NC' (for New Chapter) is ticked. When creating multi-chapter HTML this will begin a new chapter file whenever this Style is found. Styles where 'NC' is ticked will also automatically appear in the Index file so there's no need to tick this as well.

Sub Heading' is used in the original document for the headings to each section, so I've chosen <H4> as an appropriate tag.

The three buttons labelled 'p', 'br' and 'none' define what (if any) tags are used at the end of each line, or, more accurately, each time a linefeed (ascii 10) character is found. As Impression exports text unformatted this will only appear if you pressed RETURN when writing the text, such as at the end of a paragraph.

'p' means that each paragraph is wrapped in <P> and </P> tags. This is what you would use for the main body of your text. This is 'on' by default, so it will be applied to the one Style not shown on the menu but which exists in all Impression documents, namely 'Normal'. As with Impression, this is simply the default text used in the browser. As it happens, since this is a fairly simple Mappings file, all the other styles don't use <P> and </P>, so none of them have this option set.

'br' means use <BR> at the end of a line instead of <P> and </P> so there will be no gap between paragraphs as is otherwise the case. This is used for the Style NoGap. In the original Impression document there is normally a 3pt gap between paragraphs to separate them. NoGap is the same as the main body text but without this gap and is used where I don't want a gap between lines when I press RETURN. In the HTML mapping it has no definition other than having 'br' selected, so it's the same as the main body text but will have <BR> at the ends of lines so there won't be a gap between paragraphs, just as in the original document.

'None' means exactly that . There is no <BR> os </P> tag at the ends of lines. This is applied to the two Heading Styles because these are used only for single lines (Main and Sub headings) and in HTML an <H?> tag always begins a new line and a new line is begun after an </H?> tag so there's no need.

The Style Code would be used in the original document where I wanted to enter something like example code and is defined with a monospaced font (actually Corpus) and no gap between lines. This is therefore converted to <PRE> tags in HTML with no tags at the end of lines as all browsers will being a new line in <PRE> mode when they find a linefeed but many will add another when they find <BR>, so if <BR> was used the text would all be double-spaced.

More on the Styles menu

This menu is 'user configurable' (see later) so you can have whatever codes you prefer on it and omit those you don't use to make it more manageable.

The 'Fontsize +1' and 'Fontsize +2' items can be used, often in combination with a Bold tag, to give larger than normal text where you don't want to use an <H?> tag. As the tags can be edited after you've selected them it's easy to alter this to '+3' or even '-1' after you've selected them so you don't need a menu entry for everything you might want to use.

'nbsp' inserts a &nbsp; tag but only in the 'on' section. This is a simply way of indenting a sub heading by a small amount.

'Line' inserts a <HR> tag into the 'on' section. This is useful in long HTML documents, especially if you are not breaking it up into chapters as it can be used to provide a 'visual' break between sections.

Saving a Mappings file

To save a mappings file just click on 'Save'. Unless this is a new file (see next section' you may want to change its name first. If you try to save a file and one of the same name already exists you will be warned first.

When you save a Mappings file only those Styles where the icon at the extreme left is ticked will be included. This lets you omit styles which have no appropriate HTML equivalent. Imp-HTML will simply ignore any styles that aren't included in the Mappings file so they'll appear in the base style.

Creating a Mappings file

In the previous section we loaded and examined a pre-existing Mappings file. If, like me, you tend to use variations on a few standard layouts for your documents this is a good way of working. However, at some point you will need to create a Mappings file from scratch.

Scanning the TextStory

Open the Mappings window and drag a TextStory saved (with Styles) from Impression. Once again I'll use the TextStory of this manual as an example

The window will clear any previous definitions and, after Imp-HTML has scanned the file looking for all Style definitions, the Styles section should look like this.

On the left are shown all the Styles that Imp-HTML found. You will see that there's one more than appeared in the previous example, Contents. This is because the Style was present in the original document but I didn't bother to include in in the Mappings file as it is only used in a part of the document that is not going to be converted to HTML, namely the Contents.

By default all Styles are 'selected' for inclusion in the Mappings file (it's easier to re-load and remove a Style you've included by mistake than add one you've forgotten to tick) and the 'p' switch is set.

The 'Effects' at the top left are all blank. You can either enter specific codes or just click on 'Default'. Later I shall explain how you can set the defaults for Effects.

Setting the Styles

This is done as previously described. One thing to note is that if you select H1 or H2 (such as for the Main Heading) you will see that 'none' and 'NC' also become selected. Similarly if you choose H4 for something (such as Sub Heading) you will see that 'none' and 'I' are automatically set. This is because these options are included as part of the definitions for these items.

Saving the file

Saving is done by just clicking on 'Save'. A default filename of 'NewFile' will have been placed in the filename icon when you start a new Mappings file, and you would normally alter this first. You should also remember to 'un-tick' any Styles that won't have HTML equivalents first.

If you now click on the menu icon you should see your new file listed.

Style 'culling'

The Mapping window can accommodate only 14 Styles to map to HTML tags. This would normally be more than enough, since you wouldn't often create more than 14 distinctive Styles in an Impression document. However, it is possible, especially where you have a number of variations on the base Style with different rulers or where your standard document layout includes a lot of Styles, only some of which are used in the document to be converted.

If there are more than 14 Styles in a document then when you drag the TextStory to Imp-HTML it will open a window showing all the Styles in the TextStory.

Select the Styles you want to use and leave the ones that either aren't used in the document or which cannot be given any meaningful 'translation' into HTML unselected, then click on 'OK'. The selected Styles will be transferred to the Mapping window, the Abridge Styles window will close and you can proceed exactly as previously described.

Appendix

This sections contains information on the various files used 'internally' by Imp-HTML and explains how you you can alter these to suit your own requirements.

Setup files

These can be found in the sub directory 'Setup' of the Imp-HTML application directory. In here you will find five files, they are;

  CharMap
  D_Effects
  Effects
  Preamble
  Styles

CharMap

This is a list of ascii characters and the characters or strings of characters they will be converted to in HTML. Each line has the character and then a Tab character (ascii 9) followed by the HTML equivalent. If you load this file into a text editor it should look something like this.

The appearance of the Tab character, shown as a right pointing arrow here, will depend upon which text editor you use.

If you are familiar with HTML you will recognise these. For example, the 'sexed' double quotation marks are translated into &quot; the pound character into &pound:, the < and > characters into &lt; and &gt; and so on. You can modify or at to this list if you wish.

D_Effects

This is the list of Effects that will be used if you click on 'Default' in the Effects section of the Mapping window. It consists of four lines, one each for Bold, Italic, Underline and Centre, in that order. On each line is, first, the tag used to switch the effect on, then a Tab character, then the tag used to cancel the effect.

If you examine this file in a text editor you will see that there is a space after the 'cancel' tags for each of the first three. This is because when you double-click on a word in Impression to highlight it and then make the word bold, italic or underlined, if the word is followed by a space Impression also includes the space in the Effect. For example, if you make a word bold in this way in Impression then the 'cancel bold' command will actually be placed after the following space. For some strange reason in HTML spaces before a 'cancel' are ignored, so -

  This is Bold and this isn't.

will become, in HTML

  This is Boldand this isn't.

To avoid this a space is placed after the 'cancel' tag. This can sometimes mean that you get a spurious space, for example, between a word and a full stop or comma, but even if you don't bother to weed these out they are much less common and less offensive to the eye than the 'joined up' words illustrated above.

Effects

This defines the Effects menu.

Each line in the file creates a line on the menu. First there is the name (no more than 12 characters) as it will appear on the menu. This is followed by a Tab character, then the code to turn the effect on, another Tab character, and the code to turn the effect off. The illustration shows the default file and menu and you can see how the two relate.

You can, of course, edit or add to this file to suit your own preferences.

Preamble

This will be used as the first part of any HTML created. It can contain anything you like, as long as it's valid HTML. The default Preamble file is shown below, and this is extremely basic.

Styles

This creates the Styles menu. Normally this would be a bit longer and more complicated that the Effects file, but it follows the same pattern. On each line is first the name as it will appear on the menu (no more than 12 characters), a Tab character, then the 'on' code, another Tab character, then the 'off' code.

The standard file, and the menu, are shown below.

You will see one obvious difference from the Effects file. Some of the first entries have another Tab character and field after the 'off' code. This is used to force the settings of the buttons that set the end of line mapping and whether the Style starts a new HTML chapter or appears in the Index. Five letters are recognised, and these can be on their own or paired. They must be in upper case, and are -

  •   I - Include anything in this Style in the Index file
  •   C - Start a new Chapter when you find this Style
  •   N - Set the the end of line button to 'none'
  •   B - Set the end of line button to 'br'
  •   P - Set the end of line button to 'p'

The 'Colour Red' entry is really only included as an example of how you can use this menu to make your headings and sub headings coloured if you wish.

!Imp-HTML is Copyright © David Holden 2002.

 Index