The intention of this page is to provide some explanation
of the gedcom parser, to aid development on and with it. Currently,
the parser is in a state that it works, but some parts are still missing,
notably the interface towards applications. First, some practical issues
of testing with the parser will be explained.
Basic testing
The parser is located in the "gedcom" subdirectory of the Genes source code.
You should be able to perform a basic test using the commands:
make clean
make
make test
If everything goes OK, you'll see that some gedcom files are parsed, and
that each parse is successful. Note that the used gedcom files are made
by
Heiner Eichmann
and are an excellent way to test gedcom parsers thoroughly.
Preparing for further testing
The basic testing described above doesn't show anything else than "Parse
succeeded", which is nice, but not very interesting. Some more detailed
tests are possible, via the
gedcom-parse
program that is generated
by
make test
.
However, since the output that
gedcom-parse
generates is in
UTF-8 format (more on this later), some preparation is necessary to have a
full view on it. Basically, you need a terminal that understands and
can display UTF-8 encoded characters, and you need to proper fonts installed
to display them. I'll give some advice on this here, based on the Red
Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. Any
other distribution that has the same or newer versions for these components
should give the same results.
For the first issue, the UTF-8 capable terminal, the safest bet is to use
xterm
in its unicode mode (which is supported by the
xterm
coming with XFree86 4.0.x). UTF-8 capabilities have only
recently been added to
gnome-terminal
, so probably that is not
in your distribution yet (it certainly isn't in Red Hat 7.1).
For the second issue, you'll need the ISO 10646-1 fonts. These come
also with XFree86 4.0.x.
The way to start
xterm
in unicode mode is then e.g. (put everything
on 1 line !):
LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
This first sets the
LANG
variable to a locale that uses
UTF-8, and then starts
xterm
with a proper Unicode font. Some
sample UTF-8 plain text files can be found
here
. Just
cat
them on the command line and see the result.
Testing the parser with debugging
Given the UTF-8 capable terminal, you can now let the
gedcom-parse
program print the values that it parses. An example of a command line
is (in the
gedcom
directory):
./gedcom_parse -dg t/ulhc.ged
The
-dg
option instructs the parser to show its own debug messages
(see
./gedcom_parse -h
for the full set of options). If
everything is OK, you'll see the values from the gedcom file, containing
a lot of special characters.
For the ANSEL test file (
t/ansel.ged
), you have to set the environment
variable
GCONV_PATH
to the
ansel
subdirectory of
the gedcom directory:
export GCONV_PATH=./ansel
./gedcom_parse -dg t/ansel.ged
This is because for the ANSEL character set an extra module is needed for
the iconv library (more on this later). But again, this should show
a lot of special characters.
Testing the lexers separately
The lexers themselves can be tested separately. For the 1-byte lexer
(i.e. supporting the encodings with 1 byte per characters, such as ASCII,
ANSI and ANSEL), the sequence of commands would be:
make clean
make test_1byte
cat t/allged.ged | ./test_1byte
This will show all tokens in the
t/allged.ged
test file. With
the lexers you have to make sure that you use the proper lexer for each test
file. The
test_1byte
test program is OK for
allged.ged
and
ansel.ged
(the last one again with the environment variable
set); for the
uhl*.ged
files you need the
test_hilo
test program; for the
ulh*.ged
files you need the
test_lohi
program.
This concludes the testing setup. Now for some explanations...
Structure of the parser
I see the structure of a program using the gedcom parser as follows:
TO BE COMPLETED...