Robust Action

see Instructions

To use pajek with the datafiles read by the flor.exe program (last updated 5/26/99), it will be helpful to read Analyzing Large Kinship and Marriage Networks by drw, Vladimir Batagelj and Andrej Mrvar. 1999 (forthcoming). Social Science Computer Review you will need a password from if you do not remember it.

Final Version

The final version of the algorithm is completed and a preliminary analysis done.
The Alberti graph is partitioned out of the whole ball of wax (50,000 nodes). Just looking at the top 4 generations of the lowermost big family segment, the program found 29 of 30 links in the genealogy (97% accuracy) and the database was missing only 4 significant links in the genealogy. One case was not matched because of a spelling variant that has now been corrected (Attaviano=>Ottaviano). Only in once case (Antonio Doffo Alberti) would we have made a match for a name missing a marriage date where the tax year would have made the match correctly. I hesitate to do this for now because it might introduce more errors.
Pajek will now read the ENTIRE GRAPH of 57,505 nodes and partitions for families are made automatically by adding their names to the fam-list file. Single families or their unions can be examined. Graphs such as Alberti by decade are now a matter of course for any of the 1,000 or more families. The vertical axis is the actual date of marriage by decade, starting in 1200. The color coding gives the "accuracy" measure, the number of potential fathers found (green=1, red=2, blue=3).
The total graph has 86,187 lines, 28,680 arcs, 30,959 components, and 2,134 independent marriage cycles, of which 2,127 (99.6%) are in the giant component (#59 in the partition) of 19,030 vertices and 21,426 arcs. Of the total of 2,134 independent marriage cycles, 2,113 (98.6%) are in the giant bicomponent of 5,653 vertices and 7,765 arcs. This is a very hefty bicomponent. The remaining cycles are generally created by blood marriages outside the bicomponent. Of the vertices in the bicomponent:
  • 3,568 have degree 2
  • 1,027 have degree 3
  • 538 have degree 4
  • 250 have degree 5
  • 125 have degree 6
  • 68 have degree 7
  • 40 have degree 8
  • 16 have degree 9
  • 14 have degree 10
  • 4 have degree 11
  • 2 have degree 12
  • 1 has degree 13
  • 4 are at generational depth 1 (latest)
  • 60 are at depth 2
  • 207 at depth 3
  • 512 at depth 4
  • 766 at depth 5
  • 852 at depth 6
  • 797 at depth 7
  • 707 at depth 8
  • 572 at depth 9
  • 383 at depth 10
  • 306 at depth 11
  • 204 at depth 12
  • 127 at depth 13
  • 91 at depth 14
  • 44 at depth 15
  • 16 at depth 16
  • 4 at depth 17 (earliest)

  • Instructions

    I sent the *.dat and the fam-list files (you can edit the latter) by email. Run the program using options 1,1,1. The program will construct the master file and a series of *.clu and *.cls files for each family that you add to the list in fam-list. Right now there are alberti and medici followed by stop, and the rest of the names are ignored. Option 1,1,1 will allow you to construct separate graphs for each family and option 1,1,2 will concatenate all the families in your list (you use the fam1nam.clu, fam2acc.clu, fam3dec.clu in this case for the instructions below). You dont need to use option 1,2, it was just for program development with a reduced database.

    Run Pajek and select

  • Then read the family partition such as albe1nam.clu. Click Partitions/First Partition then Partitions/Second Partition and then Partitions/Extract Second from First. Save this NAMES partition, e.g., as Albe1nam411.clu where 411 is the number of nodes in the NAMES reduction (including the 7 timeline nodes if you add dates).

  • Then read albe2acc.clu. Click Partitions/First Partition thenPartitions/Extract Second from First. Save this Partition, e.g., as Albe2acc411.clu where 411 is the number of nodes in the ACCURACY reduction.

  • Then read albe3dec.clu. Click Partitions/First Partition thenPartitions/Extract Second from First. Save this Partition, e.g., as Albe3dec411.clu where 411 is the number of nodes in the DECADES reduction. albe1nam.clu

  • Now again read (1) tne total network and (2) the family partition albe1nam.clu. To extract the alberti subnetwork, click Operations/Extract from Network/Partition [from 1 to 1 is automatic: the partition is 0/1 where the 1s are the Alberti nodes].

  • To construct the graph with decades read the DECADES partition albe3dec411.clu that you saved. Now click Draw/Draw Partition. In the Draw screen click Layes/in y direction followed by Layers/Optimze layers in x direction/Forward. Your graph will appear. If you did this correctly and you chose the 1 option at the start that adds timelines, then the seven nodes to the right of the image will be the timeline nodes with appropriate edge labels for the periods. To see the timeline and firstname labels click Options/Lines/Mark lines. To see the last name labels (which clutters the page) click Options/Mark vertices using/Labels. To see the difference between male and female lines the click Options/Mark vertices using/No Labels.

  • Thus you can do a graph for ANY family or ANY COMBINED SET of families (like the Alberti-Medici graph) by adding to the fam-list and options 1,1,1 or 1,1,2. However, a better way of doing this is just to run the FLOR.EXE program once, and later run FLOR-ADD.EXE giving the first four letters of each family name you want to concatenate. If you pick albe and medi for example, then the albemedi2acc.clu and albemedi3dec.clu files will be created for you as partitions, and you can extract in this way the union of any set of families (the output file names will always those of the first two families).

  • For checking the data family by family, print relevant sections of the "scratch" file created bu the program, in which each family is taken up alphabetically and within each family the sibling sets are taken up alphabetically by the name of the father. The format will become evident if you stare at it for a while. The last pair of numbers on the right are masterID numbers for a couple corresponding to the individual, and the parents of that individual. These numbers are repeated in the "Spss" file which can be inserted directly as variables in the master file but some of the nunbers will refer to contructed parents. There is no funny stuff anymore with constructed parents. All assumptions are STRICTLY CONSERVATIVE. If there is anything funny in the scratch file let me know immediately.

  • To see the size of the bicomponent (which can generate a 0/1 partition that can go into the MASTER file for cross tabulation with other variables) the quickest way is to read tne total network and click Partition/Components/Bi-Components ... and choose a large minimum size such as 100 or 1000. Then click Hierarchy/Extract Cluster (you can alway choose cluster 1 when there is only one giant bicomponent). You will be amazed at how quickly the bicomponent is computed (in a few seconds). The click Operations/Extract from Network/Cluster and the total network will reduce to 5726. This is to big to draw all at once I believe but it can become an important analytic variable, and its relinking index is an important fact. You can also reduce the total network for a specific range of decades, and then recalculate the bi-component for that specific historical era. This opens the way to my kind of analyis. Have fun!

    Name Matching Notes: Algorithm tested for Alberti

    14albberjac - perfect match of father except for wife's name (computer: sandra; genealogy: Lisa)
    75% predicted with 100% accuracy for Alberti genealogy I-VII (missing VIII)
    15% probably right 67% later than genealogy
    10% probable 60% from alternatives given dates of marriages
    total accuracy probably 91% but rated by certain / likely / best guess
    some misses due to longer fa-son lag in marriage dates: 48,60,61,64,66 years, often due to second marriages

    Second Test of First Algorithm

    Includes a numeric quality rank for predicting potential matches
    TRUE ANCESTORS occur at the following rank frequencies:
    57 were rank 1 (predicted perfectly
    1 tied for quality rank 1
    5 tied for inferior rank 1 (both some possibility)
    7 rank 2
    2 rank 3
    1 rank 4
    1 rank 5
    2 tied among six with inferior rank
    2 not found at all, although six with inferior rank
    55 have only one possible match
    6 have 2 possible matches
    1 have 5 possible matches
    1 have 6 possible matches
    1 have 8 possible matches
    -ONE: 100% in 75 predictions, 30 verified
    -TWO-: 67% in 17 predictions, 8/12 verified
    THREE: 59% in 17 predictions, 10/17 verified
    FOUR+: 40% in 23 predictions 6/15 verified
    FOR PAJEK FILES, stripping away the arcs from nodes with more than 1, 2, 3 etc. parents of a given type with successively improve the accuracy of the genealogy
    OVERALL ACCURACY IF YOU TAKE ALL PREDICTIONS is 90% with no contradictions in genealogy
    among those found in the genealogies, the confirmed hit rate is 73%
    almost 100% of the ancestors in the genealogies who do have offspring are found by the algorithm as possible alternatives, even if not the top prediction
    Graphic Image of ALL predictions (men only), even if erroneous in many places
    Yellow: 1 possibility only (100% accurate)
    Green: 2 possilities (67% accurate)
    Red: 3 possibilities (59%)
    Other: 40% accurate, 4-9 possibilites (many those lacking dates from the tax records); the algorithm can be improved by upgrading predictions where there is one CLEAR LEADER because of marriage dates even if the name of the father is a common one. This procedure will also improve accuracy.

    First Program

    flor1.for with documentation lines and output pgraph explanation
    marriage dates assumption: Son later (10 to 40) years (increase to 66?)
    marriage dates assumption: Son under 10/over 40: not the father (some errors!)
    tax rolls assumption: Son same or up to 60 years later, if no marriage data
    first names matchup: strip everything after / . , m.
    adjacent lines with same master ID: same ego

    Errors: Changes in the SPSS data

    14albberant1 and 14albberant1 should be SEPARATE MASTER IDS IV:4 and IV:4 in GENEALOGIES
    The Spss data is fairly clean as regards standardization of spellings BUT SOME MORE NAMES WERE STANDARDIZED.
    problems encountered: Niccolo vs. Niccolo/aio m. etc.
    188buobadghe Badella -> Badello should be male not female
    367fretegner Teglia -> Teglio should be male not female
    720sconerber Nera -> Nero should be male not female
    14albmargia1 'F' should be female not male
    9aglguemas marr 378 -> 1378 (error in date)
    404gindomgiu taxyear 48 -> 480 (just a guess)
    if you transform/calculate a new Spss variable taxlessm=taxyear+1000-marr and do a univariate frequency count you will see that the difference runs from -65 to +64 with a mean of 2, and two wild outliers of +75 and +99 as follows:
    292corlucgio taxed: 1427 Married: 1328 obviously one of these is wrong
    62balaldagn taxed: 1427 Married: 1352 [[and of course there are many more where the dates are 50 or more years discrepant]
    18aldmatbon fpart 366 -> 1366

    Second Program

    flor.for with documentation lines and output pgraph explanation
    Graphic Image of BEST predictions (men AND women), colored yellow=one choice, green=2, red=3 and blue=4.

    Other Questions

    I am trying to make the most of the dates for fa/son matching hence:
    lana has only two entries: 353 (421 cases) and 382 (326 cases). What are these?
    cambm has dates from 1330-1524 what are they?
    fpart has dates from 1299-1503 what are they?
    nc427 1-8, diminishing freqs what are they?
    comp427 eg C999000L what are they?
    patr480 0-57930 mean of 73 what are they?

    For my info

    tax years are 351 378 403 427 458 480
    ngh years are 351 378 403 427 458 480 - can get date from taxyear
    pair480 1-over 334
    ngh351 11-44
    ngh378 11-44
    qt403 1-4 quarters of town?
    ngh403 11-44
    ngh427 11-44
    ngh458 11-44
    ngh480 11-44