Robust Action
see Instructions
To use pajek with the datafiles read by the flor.exe
program (last updated 5/26/99), it will be helpful to read
Analyzing Large Kinship and Marriage Networks
by drw, Vladimir Batagelj and Andrej Mrvar. 1999 (forthcoming).
Social Science Computer Review you will need a password from drwhite@uci.edu if you do not remember it.
Final Version
The final version of the algorithm is completed and a preliminary analysis done.
The Alberti graph is partitioned out of the whole ball of wax (50,000 nodes).
Just looking at the top 4 generations of the lowermost big family segment, the program found
29 of 30 links in the genealogy (97% accuracy) and the database was missing only 4 significant links in the genealogy.
One case was not matched because of a spelling variant that has now been corrected (Attaviano=>Ottaviano).
Only in once case (Antonio Doffo Alberti) would we have made a match for a name missing a marriage date
where the tax year would have made the match correctly. I hesitate to do this for now because it might
introduce more errors.
Pajek will now read the ENTIRE GRAPH of 57,505 nodes and partitions for families are
made automatically by adding their names to the fam-list file. Single families or their unions can be
examined. Graphs such as Alberti by decade are now a matter of course
for any of the 1,000 or more families. The vertical axis is the actual date of marriage by decade, starting in 1200.
The color coding gives the "accuracy" measure, the number of potential fathers found (green=1, red=2, blue=3).
The total graph has 86,187 lines, 28,680 arcs, 30,959 components,
and 2,134 independent marriage cycles, of which 2,127 (99.6%) are in the giant component
(#59 in the partition) of 19,030 vertices and 21,426 arcs. Of the total of 2,134 independent
marriage cycles, 2,113 (98.6%) are in the giant bicomponent of 5,653 vertices and 7,765 arcs. This is a very hefty bicomponent.
The remaining cycles are generally created by blood marriages outside the bicomponent. Of the vertices
in the bicomponent:
3,568 have degree 2
1,027 have degree 3
538 have degree 4
250 have degree 5
125 have degree 6
68 have degree 7
40 have degree 8
16 have degree 9
14 have degree 10
4 have degree 11
2 have degree 12
1 has degree 13
4 are at generational depth 1 (latest)
60 are at depth 2
207 at depth 3
512 at depth 4
766 at depth 5
852 at depth 6
797 at depth 7
707 at depth 8
572 at depth 9
383 at depth 10
306 at depth 11
204 at depth 12
127 at depth 13
91 at depth 14
44 at depth 15
16 at depth 16
4 at depth 17 (earliest)
Instructions
I sent the *.dat and the fam-list files (you can edit the latter) by email.
Run the program using options 1,1,1.
The program will construct the master best.net file and a series of *.clu and *.cls files for each family that you add
to the list in fam-list. Right now there are alberti and medici followed by stop, and the rest of the names are ignored.
Option 1,1,1 will allow you to construct separate graphs for each family and option 1,1,2 will concatenate all the families in
your list (you use the fam1nam.clu, fam2acc.clu, fam3dec.clu in this case for the instructions below). You dont need to use
option 1,2, it was just for program development with a reduced database.
Run Pajek and select best.net.
Then read the family partition such as albe1nam.clu. Click Partitions/First Partition
then Partitions/Second Partition and then Partitions/Extract Second from First. Save this NAMES partition, e.g., as Albe1nam411.clu
where 411 is the number of nodes in the NAMES reduction (including the 7 timeline nodes if you add dates).
Then read albe2acc.clu. Click Partitions/First Partition
thenPartitions/Extract Second from First. Save this Partition, e.g., as Albe2acc411.clu
where 411 is the number of nodes in the ACCURACY reduction.
Then read albe3dec.clu. Click Partitions/First Partition
thenPartitions/Extract Second from First. Save this Partition, e.g., as Albe3dec411.clu
where 411 is the number of nodes in the DECADES reduction.
albe1nam.clu
Now again read (1) tne total network best.net and (2) the family partition albe1nam.clu.
To extract the alberti subnetwork, click
Operations/Extract from Network/Partition [from 1 to 1 is automatic: the partition is 0/1 where the 1s are the Alberti nodes].
To construct the graph with decades read the DECADES partition albe3dec411.clu that you saved. Now click Draw/Draw
Partition. In the Draw screen click Layes/in y direction followed by Layers/Optimze layers in x direction/Forward.
Your graph will appear. If you did this correctly and you chose the 1 option at the start that adds timelines, then the seven nodes to the right
of the image will be the timeline nodes with appropriate edge labels for the periods. To see the timeline and firstname labels click
Options/Lines/Mark lines. To see the last name labels (which clutters the page)
click Options/Mark vertices using/Labels. To see the difference between male and female lines the click
Options/Mark vertices using/No Labels.
Thus you can do a graph for ANY family or ANY COMBINED SET of families (like the Alberti-Medici graph) by adding to the fam-list and options 1,1,1 or 1,1,2. However,
a better way of doing this is just to run the FLOR.EXE program once, and later run FLOR-ADD.EXE
giving the first four letters of each
family name you want to concatenate. If you pick albe and medi for example, then the albemedi2acc.clu and albemedi3dec.clu
files will be created for you as partitions, and you can extract in this way the union of any set of families (the output file names
will always those of the first two families).
For checking the data family by family, print relevant sections of the "scratch" file created bu the program,
in which each family is taken up
alphabetically and within each family the sibling sets are taken up alphabetically by the name of the father. The format
will become evident if you stare at it for a while. The last pair of numbers on the right are masterID numbers for a couple
corresponding to the individual, and the parents of that individual. These numbers are repeated in the "Spss" file which can be inserted
directly as variables in the master file but some of the nunbers will refer to contructed parents. There is no funny stuff anymore with
constructed parents. All assumptions are STRICTLY CONSERVATIVE. If there is anything funny in the scratch file let me know immediately.
To see the size of the bicomponent (which can generate a 0/1 partition that can go into the MASTER file for cross tabulation with other
variables) the quickest way is to read tne total network best.net and click Partition/Components/Bi-Components ... and choose a large
minimum size such as 100 or 1000. Then click Hierarchy/Extract Cluster (you can alway choose cluster 1 when there is only one giant
bicomponent). You will be amazed at how quickly the bicomponent is computed (in a few seconds). The click Operations/Extract from Network/Cluster
and the total network will reduce to 5726. This is to big to draw all at once I believe but it can become an important analytic variable, and its
relinking index is an important fact. You can also reduce the total network for a specific range of decades, and then recalculate the bi-component
for that specific historical era. This opens the way to my kind of analyis. Have fun!
Name Matching Notes: Algorithm tested for Alberti
14albberjac - perfect match of father except for wife's name (computer: sandra; genealogy: Lisa)
75% predicted with 100% accuracy for Alberti genealogy I-VII (missing VIII)
15% probably right 67% later than genealogy
10% probable 60% from alternatives given dates of marriages
total accuracy probably 91% but rated by certain / likely / best guess
some misses due to longer fa-son lag in marriage dates: 48,60,61,64,66 years, often due to second marriages
Second Test of First Algorithm
Includes a numeric quality rank for predicting potential matches
TRUE ANCESTORS occur at the following rank frequencies:
57 were rank 1 (predicted perfectly
1 tied for quality rank 1
5 tied for inferior rank 1 (both some possibility)
7 rank 2
2 rank 3
1 rank 4
1 rank 5
2 tied among six with inferior rank
2 not found at all, although six with inferior rank
PREDICTIONS NOT IN THE GENEALOGY FALL AS FOLLOWS
55 have only one possible match
6 have 2 possible matches
1 have 5 possible matches
1 have 6 possible matches
1 have 8 possible matches
LOOKED AT FROM VIEWPOINT OF HOW MANY POSSIBLE CHOICES:
-ONE: 100% in 75 predictions, 30 verified
-TWO-: 67% in 17 predictions, 8/12 verified
THREE: 59% in 17 predictions, 10/17 verified
FOUR+: 40% in 23 predictions 6/15 verified
FOR PAJEK FILES, stripping away the arcs from nodes with more than
1, 2, 3 etc. parents of a given type with successively improve the accuracy of the genealogy
OVERALL ACCURACY IF YOU TAKE ALL PREDICTIONS is 90% with no contradictions in genealogy
among those found in the genealogies, the confirmed hit rate is 73%
almost 100% of the ancestors in the genealogies who do have offspring are found by the
algorithm as possible alternatives, even if not the top prediction
Graphic Image of ALL predictions (men only), even if erroneous in many places
KEY:
Yellow: 1 possibility only (100% accurate)
Green: 2 possilities (67% accurate)
Red: 3 possibilities (59%)
Other: 40% accurate, 4-9 possibilites (many those lacking dates from the tax records); the algorithm can be improved by upgrading predictions where there is one CLEAR LEADER because of marriage dates even if the name of the father is a common one.
This procedure will also improve accuracy.
First Program
fa-son.for
fa-son.exe
flor1.for with documentation lines and
output
pgraph
explanation
flor1.exe
marriage dates assumption: Son later (10 to 40) years (increase to 66?)
marriage dates assumption: Son under 10/over 40: not the father (some errors!)
tax rolls assumption: Son same or up to 60 years later, if no marriage data
first names matchup: strip everything after / . , m.
adjacent lines with same master ID: same ego
Errors: Changes in the SPSS data
14albberant1 and 14albberant1 should be SEPARATE MASTER IDS IV:4 and IV:4 in GENEALOGIES
The Spss data is fairly clean as regards standardization of spellings BUT SOME MORE NAMES WERE STANDARDIZED.
problems encountered: Niccolo vs. Niccolo/aio m. etc.
188buobadghe Badella -> Badello should be male not female
367fretegner Teglia -> Teglio should be male not female
720sconerber Nera -> Nero should be male not female
14albmargia1 'F' should be female not male
9aglguemas marr 378 -> 1378 (error in date)
404gindomgiu taxyear 48 -> 480 (just a guess)
if you transform/calculate a new Spss variable taxlessm=taxyear+1000-marr and do a univariate frequency count you will
see that the difference runs from -65 to +64 with a mean of 2, and two wild outliers of +75 and +99 as follows:
292corlucgio taxed: 1427 Married: 1328 obviously one of these is wrong
62balaldagn taxed: 1427 Married: 1352 [[and of course there are many more where the dates are 50 or more years discrepant]
18aldmatbon fpart 366 -> 1366
Second Program
flor.for with documentation lines and
output
pgraph
explanation
flor.exe
Graphic Image of BEST predictions (men AND women), colored
yellow=one choice, green=2, red=3 and blue=4.
Other Questions
I am trying to make the most of the dates for fa/son matching hence:
lana has only two entries: 353 (421 cases) and 382 (326 cases). What are these?
cambm has dates from 1330-1524 what are they?
fpart has dates from 1299-1503 what are they?
nc427 1-8, diminishing freqs what are they?
comp427 eg C999000L what are they?
patr480 0-57930 mean of 73 what are they?
For my info
tax years are 351 378 403 427 458 480
ngh years are 351 378 403 427 458 480 - can get date from taxyear
pair480 1-over 334
ngh351 11-44
ngh378 11-44
qt403 1-4 quarters of town?
ngh403 11-44
ngh427 11-44
ngh458 11-44
ngh480 11-44