Reducing file bloat

Change dates

All top level records in a GEDCOM file can record the date and time they were last modified. The tidyged package (the main package for creating and summarising GEDCOM files) includes change dates (today’s date) by default every time a record is created or modified. Since the time is very unlikely to be useful in such a context, the package ignores this by default. We illustrate by loading the tidyged and tidyged.utils packages, and creating an example object.

library(tidyged)
library(tidyged.utils)

gedcom(subm("Me")) |> 
  knitr::kable()

level	record	tag	value
0	HD	HEAD
1	HD	GEDC
2	HD	VERS	5.5.5
2	HD	FORM	LINEAGE-LINKED
3	HD	VERS	5.5.5
1	HD	CHAR	UTF-8
1	HD	DEST	gedcompendium
1	HD	SOUR	gedcompendium
2	HD	NAME	The ‘gedcompendium’ ecosystem of packages for the R language
2	HD	CORP	Jamie Lendrum
3	HD	ADDR
3	HD	EMAIL	[email protected]
3	HD	WWW	https://jl5000.github.io/tidyged/
1	HD	DATE	22 NOV 2024
1	HD	LANG	English
1	HD	SUBM	@U1@
0	@U1@	SUBM
1	@U1@	NAME	Me
1	@U1@	CHAN
2	@U1@	DATE	22 NOV 2024
0	TR	TRLR

See row 19 and the row after for the change date for the submitter record.

For GEDCOM files with thousands of records, including change dates can add considerable bloat. For this reason it is possible to remove all change date structures with the remove_change_dates() function:

gedcom(subm("Me")) |> 
  remove_change_dates() |> 
  knitr::kable()

level	record	tag	value
0	HD	HEAD
1	HD	GEDC
2	HD	VERS	5.5.5
2	HD	FORM	LINEAGE-LINKED
3	HD	VERS	5.5.5
1	HD	CHAR	UTF-8
1	HD	DEST	gedcompendium
1	HD	SOUR	gedcompendium
2	HD	NAME	The ‘gedcompendium’ ecosystem of packages for the R language
2	HD	CORP	Jamie Lendrum
3	HD	ADDR
3	HD	EMAIL	[email protected]
3	HD	WWW	https://jl5000.github.io/tidyged/
1	HD	DATE	22 NOV 2024
1	HD	LANG	English
1	HD	SUBM	@U1@
0	@U1@	SUBM
1	@U1@	NAME	Me
0	TR	TRLR

Unreferenced records

If there are any records that are not referenced anywhere else, they can be found with the identify_unused_records() function. In the example below we create 6 family group records, half with members, half without, and also an unreferenced Repository record:

some_unref <- gedcom(subm("Me")) |> 
  add_indi(qn = "Tom Smith") |> 
  add_indi(qn = "Tammy Smith") |> 
  add_indi(qn = "Alice White") |> 
  add_indi(qn = "Phil Brown")
#> Added Unknown Individual: @I1@
#> Added Unknown Individual: @I2@
#> Added Unknown Individual: @I3@
#> Added Unknown Individual: @I4@

tom_xref <- find_indi_name(some_unref, "Tom")
tammy_xref <- find_indi_name(some_unref, "Tammy")
phil_xref <- find_indi_name(some_unref, "Phil")
alice_xref <- find_indi_name(some_unref, "Alice")

some_unref <- some_unref |>
  add_famg(husband = tom_xref, wife = tammy_xref) |> 
  add_famg() |> 
  add_famg(husband = phil_xref) |> 
  add_famg() |> 
  add_famg(children = alice_xref) |> 
  add_famg() |> 
  add_repo("Test repo") 
#> Added Family Group: @F1@
#> Added Family Group: @F2@
#> Added Family Group: @F3@
#> Added Family Group: @F4@
#> Added Family Group: @F5@
#> Added Family Group: @F6@
#> Added Repository: @R1@
  
identify_unused_records(some_unref)
#> [1] "@F2@" "@F4@" "@F6@" "@R1@"

We can find out more about these xrefs by using the describe_records() function from the tidyged package:

identify_unused_records(some_unref) |> 
  describe_records(gedcom = some_unref)
#> [1] "Family @F2@, headed by no individuals, and no children"
#> [2] "Family @F4@, headed by no individuals, and no children"
#> [3] "Family @F6@, headed by no individuals, and no children"
#> [4] "Repository @R1@, Test repo"

- Change dates
- Unreferenced records