Splitting and Merging GEDCOM files

Introduction

One of the most difficult tasks to carry out is the splitting of GEDCOM files into two, and the merging of two GEDCOM files into one. Maybe you want to take a specific branch of your tree to give to someone else. Maybe you would like to combine two files describing your maternal ancestors and paternal ancestors. One of the reasons why it is so difficult is because of the cross references that occur within a GEDCOM file, and determining if two records should be merged into one.

We illustrate with the following sample file:

library(tidyged)
library(tidyged.utils)

summary(sample555)
#> GEDCOM file summary: 
#>  
#>  Submitter:               Reldon Poulson 
#>  Description:              
#>  Language:                English 
#>  Character set:           UTF-8 
#>  
#>  Copyright:                
#>  
#>  Source system:           GS 
#>  Source system version:   5.5.5 
#>  Product name:            GEDCOM Specification 
#>  Product source:          gedcom.org
describe_records(sample555, sample555$record, short_desc = TRUE)
#> [1] "Submitter @U1@, Reldon Poulson"                                       
#> [2] "Individual @I1@, Robert Eugene Williams"                              
#> [3] "Individual @I2@, Mary Ann Wilson"                                     
#> [4] "Individual @I3@, Joe Williams"                                        
#> [5] "Family @F1@, headed by Robert Eugene Williams and Mary Ann Wilson"    
#> [6] "Family @F2@, headed by Robert Eugene Williams"                        
#> [7] "Source @S1@, titled Madison County Birth, Death, and Marriage Records"
#> [8] "Repository @R1@, Family History Library"

Splitting files

Splitting a file is much easier than merging two files. In order to split a file we use the split_gedcom() function and provide the xrefs of the records we would like to be contained in the new file. In this example, we’re going to take the family @F2@ and the two individuals within it:

new <- split_gedcom(sample555, c("@F2@", "@I1@", "@I3@"))
#> Some dead record references have been removed: @S1@, @F1@

summary(new)
#> GEDCOM file summary: 
#>  
#>  Submitter:               Reldon Poulson 
#>  Description:              
#>  Language:                English 
#>  Character set:           UTF-8 
#>  
#>  Copyright:                
#>  
#>  Source system:           GS 
#>  Source system version:   5.5.5 
#>  Product name:            GEDCOM Specification 
#>  Product source:          gedcom.org

With this new file we can see it has the exact same header and submitter information. Let’s take a look to see what records it contains:

describe_records(new, new$record, short_desc = TRUE)
#> [1] "Submitter @U1@, Reldon Poulson"               
#> [2] "Individual @I1@, Robert Eugene Williams"      
#> [3] "Individual @I3@, Joe Williams"                
#> [4] "Family @F2@, headed by Robert Eugene Williams"

By default, this function will remove references to records that do not exist in the file. The function will tell us which records these are in case you want to go back and include them.

Merging files

Merging two files is a much more involved affair. Cross reference identifiers must be made unique across both files, potential duplicate records must be identified, and then merged. This is all done automatically using the merge_gedcoms() function.

Unfortunately it cannot be demonstrated here since it seeks user input when potentially duplicate records are identified.

The process of merging files contains many steps, and some of these steps are useful in their own right and are exposed to the user. These are:

Identifying if records in a file are potentially duplicates (seeks user input)
Merging selected records into a single record
Removing duplicate subrecords

Merging records

Multiple records can be merged into one using the merge_records() function. To illustrate, we take the sample file and add another duplicate record for one of the individuals:

with_dupes <- sample555 |> 
  add_indi(sex = "M") |> 
  add_indi_names(name_pieces(given = "Joe", surname = "Williams"))
#> Added Male Individual: @I4@

describe_records(with_dupes, with_dupes$record, short_desc = TRUE)
#> [1] "Submitter @U1@, Reldon Poulson"                                       
#> [2] "Individual @I1@, Robert Eugene Williams"                              
#> [3] "Individual @I2@, Mary Ann Wilson"                                     
#> [4] "Individual @I3@, Joe Williams"                                        
#> [5] "Family @F1@, headed by Robert Eugene Williams and Mary Ann Wilson"    
#> [6] "Family @F2@, headed by Robert Eugene Williams"                        
#> [7] "Source @S1@, titled Madison County Birth, Death, and Marriage Records"
#> [8] "Repository @R1@, Family History Library"                              
#> [9] "Individual @I4@, Joe Williams"

We now merge the two records:

merged <- merge_records(with_dupes, c("@I3@","@I4@"))

describe_records(merged, merged$record, short_desc = TRUE)
#> [1] "Submitter @U1@, Reldon Poulson"                                       
#> [2] "Individual @I1@, Robert Eugene Williams"                              
#> [3] "Individual @I2@, Mary Ann Wilson"                                     
#> [4] "Family @F1@, headed by Robert Eugene Williams and Mary Ann Wilson"    
#> [5] "Family @F2@, headed by Robert Eugene Williams"                        
#> [6] "Source @S1@, titled Madison County Birth, Death, and Marriage Records"
#> [7] "Repository @R1@, Family History Library"                              
#> [8] "Individual @I3@, Joe Williams"

We can take a closer look at this merged record to see what has happened:

dplyr::filter(merged, record == "@I3@") |> 
  knitr::kable()

level	record	tag	value
0	@I3@	INDI
1	@I3@	NAME	Joe /Williams/
2	@I3@	SURN	Williams
2	@I3@	GIVN	Joe
1	@I3@	SEX	M
1	@I3@	BIRT
2	@I3@	DATE	11 JUN 1861
2	@I3@	PLAC	Idaho Falls, Bonneville, Idaho, United States of America
1	@I3@	FAMC	@F1@
1	@I3@	FAMC	@F2@
2	@I3@	PEDI	adopted
1	@I3@	ADOP
2	@I3@	DATE	16 MAR 1864
1	@I3@	SEX	M
1	@I3@	NAME	Joe /Williams/
2	@I3@	GIVN	Joe
2	@I3@	SURN	Williams
1	@I3@	CHAN
2	@I3@	DATE	22 NOV 2024

Removing duplicate subrecords

We can see that we now have a duplicate sex subrecord and a duplicate name subrecord. We can remove these with the remove_duplicate_subrecords() function:

remove_duplicate_subrecords(merged, "@I3@") |> 
  dplyr::filter(record == "@I3@") |>
  knitr::kable()

level	record	tag	value
0	@I3@	INDI
1	@I3@	NAME	Joe /Williams/
2	@I3@	SURN	Williams
2	@I3@	GIVN	Joe
1	@I3@	SEX	M
1	@I3@	BIRT
2	@I3@	DATE	11 JUN 1861
2	@I3@	PLAC	Idaho Falls, Bonneville, Idaho, United States of America
1	@I3@	FAMC	@F1@
1	@I3@	FAMC	@F2@
2	@I3@	PEDI	adopted
1	@I3@	ADOP
2	@I3@	DATE	16 MAR 1864
1	@I3@	CHAN
2	@I3@	DATE	22 NOV 2024

Both duplicate subrecords have been removed.