home *** CD-ROM | disk | FTP | other *** search
- BFF: A grammar for Binary File Formats
-
- With a growing number of binary formats that are being used, there is a
- need for specifying these formats in a well-defined way. Context free
- grammars have been used to specify the syntax of programming langauges. To
- use a grammar for binary file formats seems to be a logical choice.
-
- In this page such a grammar, named BFF, is described. It has several
- construct that are not traditionally found in context free grammars for
- programming language. Due to the nature of binary file formats, it is
- important to be able to reference information that has been read before.
- For example, a string of characters might be preceded by a number that
- indicates the lengt of the string.
-
- The terminal symbols of the grammar consist of a number of bytes,
- representing one of the basic data types, such as: char, short int, long
- int, float and double. Differences in byte ordering for integers, and the
- different formats for floating point numbers should be taken into account.
-
- Due to the nature of binary formats, it is not too restrictive to use only
- recursive descent grammars, e.i., grammars that can be parsed top-down, and
- belong to the LL(1) class.
-
- The BFF is tested for the DWG file format. As we start with this file
- format, BFF will naturally first focus on the requirements based on this
- format, and because of this it will be slightly biased.
-
- To specify the grammar of BFF we will use a form of extended BNF.
-
- Tool support for BFF
-
- The first tool I am thinking about is a program that can read the grammar
- and apply this to a given binary file, resulting in an annotated output.
-
- With this tool should also support the reverse engineering of binary file
- formats.
-
- In a later state, a tool could be made that generates a parser and the
- needed data structures for reading a binary file into memory according to
- the grammar. As it is not always required (nor possible) to read the whole
- file into memory, it should be possible to generate procedures to read the
- file interactively.
-
- The form of the BFF grammar
-
- A grammar that describes a Binary File Format consists of the specification
- of the elementary units of data, and the rules by which these should be
- grouped together.
-
- The elementary units
-
- We assume that a Binary File can be viewed as a stream of bytes (as this is
- the most commonly used unit of data). Usually a number of bytes are grouped
- together to form data values that cannot be represented by a single byte.
- To specify a word value consisting of two bytes, for example, we propose
- the following defintion style:
-
- type word :=
- byte : first,
- byte : second
- return ((word)first | ((word)second << 8)).
-
- A word representation where the lower order byte comes before the higher
- order is usually used by small Endian processors. The expression used on
- will be based on C. We assume that the following types have been defined on
- top of the default types of C:
-
- typedef unsigned char byte;
- typedef unsigned short int word;
- typedef unsigned long int longword;
-
- This leeds us to the definition of the basic types that will be supported:
-
- C_data_types :=
- "char" | "byte" |
- "short" | "word" |
- "long" | "longword" |
- "float" | "double" .
-
- (We assume for the moment that float and double represent floating numbers
- of 4, respectively 10 bytes.) The grammar of the rule used for defining
- types is:
-
- type_def_rule :=
- "typedef" C_data_type basic_type_name ":="
- ("byte" ":" byte_name) LIST
- "return" expr "." .
-
- Here expr stands for C-like expression using the byte_names as they are
- used in the rule. The basic_type_names should not be confused with the
- C_type_names. It is possible that the same name is both used as a
- basic_type_name and a C_type_name.
-
- The rules
-
- The grammar that specifies in which order elementary units are taken from a
- binary file, makes use of non-terminal symbols and rules for each
- non-terminal symbol. There will be one non-terminal symbol that will parse
- the whole binary file, which will be called the root non-terminal symbol.
- For each non-terminal symbol there has to be a rule describing the elements
- it consists of, where each element is either an elementary elements or a
- non-terminal symbols. The rule of the root non-terminal symbol comes as the
- first rule, and is preceded with the word `root'. The whole BFF grammar
- follows the following grammar:
-
- BFF_grammar :=
- type_def_rule SEQ
- "root" rule
- rule SEQ.
-
- Each rule has a non-terminal symbol on the left-hand side, and a list of
- elements on the right-hand side. Each element is either a elementary
- element, a non-terminal symbol, or a grouping of elements. Because BFF
- assumes a top-down parsing method, it is possible to give each non-terminal
- symbol a number of parameters. This leads to the following grammar for the
- rules:
-
- rule :=
- non_term_name ( "(" param LIST ")" )OPT
- ":=" elem LIST ".".
-
- Each element consist of the following parts:
-
- * An (optional) range, which specifies the range of the file that
- element may read.
- * A data type, which can either be a elementary unit, a non-terminal
- symbol, or a list of elements enclosed by brackets.
- * An (optional) times expression, to indicate that the element can be
- repeated, either for a given number of times or for an unknown number
- of times.
- * An (optional) identifying name, which can be used later to reference
- the value found.
- * An (optional) equivalence expression, which can be used for checking.
-
- The following grammar describes an element:
-
- elem := range OPT
- data_type
- ( "[" expr "]"
- | "*" )
- ( ":" elem_name )OPT
- ( "=" C_expr )OPT.
-
- range := "[" file_pos (":" file_pos)OPT "]".
-
- file_pos := "begin" | "end" | "cur" | expr.
-
- data_type := "(" elem LIST ")"
- | basic_type_name
- | non_term_name ( "(" expr LIST ")" )OPT.
-
- (To be continued ....)
-
- ---------------------------------------------------------------------------
- Last update: Tuesday, 16-Jan-96 20:42:48 MET