A sensible data model for sharing geneaological and historical information.

Last updated 2016-05-08.


XARK is an extensible, open platform for archiving and sharing genealogical or historical data. Its goal is to create a standard to supplant GEDCOM without some of the nomenclature issues and limitations of GEDCOM X, and without the extreme complexity of alternatives like STEMMA.

As an amateur genealogist and a professional software developer, I see a need for a replacement for the existing standards that supports a more rigorous research process. I joined the discussions at FHISO, but its members are hopelesly lost in minutia and incapable of creating a cogent standard. So, this project is a distillation of my ideas, a way to keep track of where I hope groups like FHISO and FamilySearch will go with their own efforts.

XARK is not concerned with forward compatibility to extant standards like GEDCOM. If the concepts align enough for conversion back to an earlier or alternative format, great, but XARK's concepts will not be limited to those that fit well with legacy formats.


An XARK file is an XML file containing one or more Source nodes.

Source Nodes

A Source node represents all knowledge about a single source of genealogical information -- a book, document, personal story, web site, etc. Sources can be derived from one or more other Sources. Sources can be referenced from one another locally (within the XARK file) or at a remote URL. Sources are the "containers" for Subject, Assertion, Event, and Note nodes. They do NOT contain other Sources, even though they can be related to other Sources.

A single XARK file will usually contain many Source nodes. First, it will generally have one Source node for each document of interest, containing information extracted from the document with minimal interpretation (this is consistent with the overall concept of "persona"-based research). Second, the XARK file will generallly contain one or more Source nodes representing the genealogist's own synthesis of the data within those documents (i.e., their "conclusions").

An key concept here is that a Source can represent either primary information (transcribed/extracted directly from a source) or secondary information (conclusions reached by some researcher in their own family tree), and as such, a researcher can reference another researcher's conclusions.

Even for a single researcher, it may be beneficial to maintain several Source nodes within a single XARK file, say, one for each family of interest. For example, my wife's family and mine are separate research interests for me, but I would want to maintain them in a single file. I may want to do the same for my maternal and paternal lines.

The Source nodes are where all bibliographic information should be stored.

Subject Nodes

Each Subject node has two required pieces of information, and one optional one:

Assertion Nodes

Assertions are the primary mechanism for decorating Subjects with properties and relationships. Every assertion has a primary Subject, a Type, and a Value.

(I toyed with the idea of not requiring a Value, but in all cases I could think of, removing the value requires more enumerated Types, so it is better I think for Types to be broad enough that there is always a required value. For example, rather than creating one Type for "Dead" and another for "Alive," it is better to have a type "Living Status" with those two value choices.

The only exception to this is where the value is NOT YET CREATED. For example, you might know someone was married but not know their name, so you may not create a Subject for them yet. This is acceptable, but the Assertion should be flagged in software as incomplete, awaiting resolution.)

There are three categories of Assertions: Facts, Relationships, and Roles. A Fact links a Subject to a standalone value. A Relationship links a Subject to another Subject. A Role links a Subject to an Event.

The Subject is the XARK ID of a local Subject node. The local Subject may in turn refer to a remote one, but all Assertion references must be to the local ID.

The Type is a URI describing how the Subject relates to the Value. This generally will represent either a type of fact, relationship, or role.

The Value is the thing being asserted. This can be a number, text value, a URN to a local Subject or Event ID, or a URL (including but not limited to a link to a remote Subject or Event). Values themselves should *NOT* contain a certainty, since all assertions can have an optional certainty. It is important to note that values are NEVER dates. Period. Any assertion that involves a Date should instead link to an Event, which gives the assertion a geospacial and temporal context.

There are also some optional fields:


An optional Certainty value can be assigned for an assertion, representing how the researcher feels about the assertion's validity. Possible values are False, Probably False, Possibly, Probably True, and True. Without getting into pedantics, the value of True should be used when the assertion is reasonably true, and Probably True should be used when there is reasonable doubt based on the source information, not supposedly conflicting other sources that the assertion would be true.

Note: the exact ways that Events and Assertions are related is still in flux. This is one of the hardest problems to solve in genealogical data modeling.

Associated Events

When an Assertion does not represent an Event Role, the assertion may have links to Events that give the assertion a geographic and temporal context. For example, a marriage relationship Assertion could be linked to various events related directly to that marriage -- the ceremony, the reception, the engagement announcement, a divorce proceeding, etc.

In other words, an Assertion should only be linked to Events directly involved in that assertion, not merely those that support or mention the relationship. For example, one should not link a Census event to a marriage, even if the census provides proof of marriage.

Each linked Event should provide a Context value that relates the event's timeframe to the assertion. The allowed values are Before, Start, During, End, Around, and After. For example, the event of a marriage ceremony would Start a relationship between the spouses, and a divorce event would End the relationship. (It is probably not a good practice to use Death events to end marriages, as ALL relationships theoretically end upon death.)

The same Event can be linked to many Assertions. For example, the same Birth event may link to assertions for the child's birth and their relationships to their parents. This is why the Role is important. However, as above, it should not be linked to the parent's marriage assertion.

Privacy Policy

Every assertion may have an optional privacy policy value. I'm still working out this concept. It could be as simple as a flag stating whether the assertion should be published or not, which would allow easy creation of XARK files that could be shared or published online without, say, compromising sensitive information about living individuals.

There is absolutely NO automatic linkage between assertions of a relationship and assertions of an event role. Being the "spouse" on a marriage event does NOT automatically create a marriage relationship assertion between the spouses, nor vice versa.

The reason for this is that managing these links would be far too complex for software and for the data model. The purpose of the Event Role is often to associate people who have a *transient* relationship to other people for the purposes of that event, or to the event itself. For example, "photographer" is a role someone plays at an event, it isn't a lifelong relationship between the spouses of a wedding and their wedding photographer.


All nodes within each Source node can be serialized as XMP and embedded within image resources.

What's with the name?

I came up with the name when Sam Ruby et al were seeking names for the standard to replace RSS. When I suggested the name, I reserved the domain just in case. The name "Atom" was chosen ultimately for that project, but I kept the domain and have been looking for a good use for it. This new project seemed to be an excellent fit.


Richard Tallent
Twitter and just about everywhere else: richardtallent