Reducing file bloat

Change dates

All top level records in a GEDCOM file can record the date and time they were last modified. The tidyged package (the main package for creating and summarising GEDCOM files) includes change dates (today’s date) by default every time a record is created or modified. Since the time is very unlikely to be useful in such a context, the package ignores this by default. We illustrate by loading the tidyged and tidyged.utils packages, and creating an example object.

library(tidyged)
library(tidyged.utils)

gedcom(subm("Me")) |> 
  knitr::kable()
level record tag value
0 HD HEAD
1 HD GEDC
2 HD VERS 5.5.5
2 HD FORM LINEAGE-LINKED
3 HD VERS 5.5.5
1 HD CHAR UTF-8
1 HD DEST gedcompendium
1 HD SOUR gedcompendium
2 HD NAME The ‘gedcompendium’ ecosystem of packages for the R language
2 HD CORP Jamie Lendrum
3 HD ADDR
3 HD EMAIL
3 HD WWW https://jl5000.github.io/tidyged/
1 HD DATE 22 NOV 2024
1 HD LANG English
1 HD SUBM @U1@
0 @U1@ SUBM
1 @U1@ NAME Me
1 @U1@ CHAN
2 @U1@ DATE 22 NOV 2024
0 TR TRLR

See row 19 and the row after for the change date for the submitter record.

For GEDCOM files with thousands of records, including change dates can add considerable bloat. For this reason it is possible to remove all change date structures with the remove_change_dates() function:

gedcom(subm("Me")) |> 
  remove_change_dates() |> 
  knitr::kable()
level record tag value
0 HD HEAD
1 HD GEDC
2 HD VERS 5.5.5
2 HD FORM LINEAGE-LINKED
3 HD VERS 5.5.5
1 HD CHAR UTF-8
1 HD DEST gedcompendium
1 HD SOUR gedcompendium
2 HD NAME The ‘gedcompendium’ ecosystem of packages for the R language
2 HD CORP Jamie Lendrum
3 HD ADDR
3 HD EMAIL
3 HD WWW https://jl5000.github.io/tidyged/
1 HD DATE 22 NOV 2024
1 HD LANG English
1 HD SUBM @U1@
0 @U1@ SUBM
1 @U1@ NAME Me
0 TR TRLR

Unreferenced records

If there are any records that are not referenced anywhere else, they can be found with the identify_unused_records() function. In the example below we create 6 family group records, half with members, half without, and also an unreferenced Repository record:

some_unref <- gedcom(subm("Me")) |> 
  add_indi(qn = "Tom Smith") |> 
  add_indi(qn = "Tammy Smith") |> 
  add_indi(qn = "Alice White") |> 
  add_indi(qn = "Phil Brown")
#> Added Unknown Individual: @I1@
#> Added Unknown Individual: @I2@
#> Added Unknown Individual: @I3@
#> Added Unknown Individual: @I4@

tom_xref <- find_indi_name(some_unref, "Tom")
tammy_xref <- find_indi_name(some_unref, "Tammy")
phil_xref <- find_indi_name(some_unref, "Phil")
alice_xref <- find_indi_name(some_unref, "Alice")

some_unref <- some_unref |>
  add_famg(husband = tom_xref, wife = tammy_xref) |> 
  add_famg() |> 
  add_famg(husband = phil_xref) |> 
  add_famg() |> 
  add_famg(children = alice_xref) |> 
  add_famg() |> 
  add_repo("Test repo") 
#> Added Family Group: @F1@
#> Added Family Group: @F2@
#> Added Family Group: @F3@
#> Added Family Group: @F4@
#> Added Family Group: @F5@
#> Added Family Group: @F6@
#> Added Repository: @R1@
  
identify_unused_records(some_unref)
#> [1] "@F2@" "@F4@" "@F6@" "@R1@"

We can find out more about these xrefs by using the describe_records() function from the tidyged package:

identify_unused_records(some_unref) |> 
  describe_records(gedcom = some_unref)
#> [1] "Family @F2@, headed by no individuals, and no children"
#> [2] "Family @F4@, headed by no individuals, and no children"
#> [3] "Family @F6@, headed by no individuals, and no children"
#> [4] "Repository @R1@, Test repo"