Last updated 2016-05-08.
XARK is an extensible, open platform for archiving and sharing genealogical or historical data. Its goal is to create a standard to supplant GEDCOM without some of the nomenclature issues and limitations of GEDCOM X, and without the extreme complexity of alternatives like STEMMA.
As an amateur genealogist and a professional software developer, I see a need for a replacement for the existing standards that supports a more rigorous research process. I joined the discussions at FHISO, but its members are hopelesly lost in minutia and incapable of creating a cogent standard. So, this project is a distillation of my ideas, a way to keep track of where I hope groups like FHISO and FamilySearch will go with their own efforts.
XARK is not concerned with forward compatibility to extant standards like GEDCOM. If the concepts align enough for conversion back to an earlier or alternative format, great, but XARK's concepts will not be limited to those that fit well with legacy formats.
An XARK file is an XML file containing one or more Source nodes.
A Source node represents all knowledge about a single source of genealogical information -- a book, document, personal story, web site, etc. Sources can be derived from one or more other Sources. Sources can be referenced from one another locally (within the XARK file) or at a remote URL. Sources are the "containers" for Subject, Assertion, Event, and Note nodes. They do NOT contain other Sources, even though they can be related to other Sources.
A single XARK file will usually contain many Source nodes. First, it will generally have one Source node for each document of interest, containing information extracted from the document with minimal interpretation (this is consistent with the overall concept of "persona"-based research). Second, the XARK file will generallly contain one or more Source nodes representing the genealogist's own synthesis of the data within those documents (i.e., their "conclusions").
An key concept here is that a Source can represent either primary information (transcribed/extracted directly from a source) or secondary information (conclusions reached by some researcher in their own family tree), and as such, a researcher can reference another researcher's conclusions.
Even for a single researcher, it may be beneficial to maintain several Source nodes within a single XARK file, say, one for each family of interest. For example, my wife's family and mine are separate research interests for me, but I would want to maintain them in a single file. I may want to do the same for my maternal and paternal lines.
The Source nodes are where all bibliographic information should be stored.
Each Subject node has two required pieces of information, and one optional one:
- ID: (required) an XARK GUID that will be used to refer to this subject.
- Type: (required) a URI representing the type of subject. URIs fill be provided for the following standard types, but this can be extended to support others in the future:
- Place (geographic, geopolitical, or physical address / structure)
- Organization (corporate, academic, religious, etc.)
- ParentSubject: (optional) a URN to a Subject within another Source in the same XARK file, or a remote URL.
The benefit of creating a Subject node linking to a Subject in another Source in the same file is that it easily allows splitting up research of families that have common people. For example, if I maintain a separate Source in my file for my family and my wife's, I can create our common family members (each other, our children, and distant shared relatives) in one Source and create a link to them from the other.
The benefit of creating a Subject node linking to a remote Subject is that I can essentially include research from other researchers in my own family history without having to duplicate that information. Keep in mind that a Subject can be a Place, Organization, etc., not just a Person, so there is ample opportunity for individual researchers, libraries, organizations, etc. to create shared information of interest to many researchers.
A remote Subject URL may resolve to a Subject entity within another XARK file, but it is not required to do so. I can, for example, create a Subject representing a county and link it to the Wikipedia page for that county. Certainly, linking to a solid remote XARK file is preferable, but requiring it would handicap the usefulness of having a remote Subject until virtually all genealogical information about those subjects is also in XARK format, and I'm not willing to assume 100% adoption.
Assertions are the primary mechanism for decorating Subjects with properties and relationships. Every assertion has a primary Subject, a Type, and a Value.
(I toyed with the idea of not requiring a Value, but in all cases I could think of, removing the value requires more enumerated Types, so it is better I think for Types to be broad enough that there is always a required value. For example, rather than creating one Type for "Dead" and another for "Alive," it is better to have a type "Living Status" with those two value choices.
The only exception to this is where the value is NOT YET CREATED. For example, you might know someone was married but not know their name, so you may not create a Subject for them yet. This is acceptable, but the Assertion should be flagged in software as incomplete, awaiting resolution.)
There are three categories of Assertions: Facts, Relationships, and Roles. A Fact links a Subject to a standalone value. A Relationship links a Subject to another Subject. A Role links a Subject to an Event.
The Subject is the XARK ID of a local Subject node. The local Subject may in turn refer to a remote one, but all Assertion references must be to the local ID.
The Type is a URI describing how the Subject relates to the Value. This generally will represent either a type of fact, relationship, or role.
The Value is the thing being asserted. This can be a number, text value, a URN to a local Subject or Event ID, or a URL (including but not limited to a link to a remote Subject or Event). Values themselves should *NOT* contain a certainty, since all assertions can have an optional certainty. It is important to note that values are NEVER dates. Period. Any assertion that involves a Date should instead link to an Event, which gives the assertion a geospacial and temporal context.
There are also some optional fields:
An optional Certainty value can be assigned for an assertion, representing how the researcher feels about the assertion's validity. Possible values are False, Probably False, Possibly, Probably True, and True. Without getting into pedantics, the value of True should be used when the assertion is reasonably true, and Probably True should be used when there is reasonable doubt based on the source information, not supposedly conflicting other sources that the assertion would be true.
Note: the exact ways that Events and Assertions are related is still in flux. This is one of the hardest problems to solve in genealogical data modeling.
When an Assertion does not represent an Event Role, the assertion may have links to Events that give the assertion a geographic and temporal context. For example, a marriage relationship Assertion could be linked to various events related directly to that marriage -- the ceremony, the reception, the engagement announcement, a divorce proceeding, etc.
In other words, an Assertion should only be linked to Events directly involved in that assertion, not merely those that support or mention the relationship. For example, one should not link a Census event to a marriage, even if the census provides proof of marriage.
Each linked Event should provide a Context value that relates the event's timeframe to the assertion. The allowed values are Before, Start, During, End, Around, and After. For example, the event of a marriage ceremony would Start a relationship between the spouses, and a divorce event would End the relationship. (It is probably not a good practice to use Death events to end marriages, as ALL relationships theoretically end upon death.)
The same Event can be linked to many Assertions. For example, the same Birth event may link to assertions for the child's birth and their relationships to their parents. This is why the Role is important. However, as above, it should not be linked to the parent's marriage assertion.
There is absolutely NO automatic linkage between assertions of a relationship and assertions of an event role. Being the "spouse" on a marriage event does NOT automatically create a marriage relationship assertion between the spouses, nor vice versa.
The reason for this is that managing these links would be far too complex for software and for the data model. The purpose of the Event Role is often to associate people who have a *transient* relationship to other people for the purposes of that event, or to the event itself. For example, "photographer" is a role someone plays at an event, it isn't a lifelong relationship between the spouses of a wedding and their wedding photographer.
All nodes within each Source node can be serialized as XMP and embedded within image resources.
What's with the name?
I came up with the name when Sam Ruby et al were seeking names for the standard to replace RSS. When I suggested the name, I reserved the domain just in case. The name "Atom" was chosen ultimately for that project, but I kept the domain and have been looking for a good use for it. This new project seemed to be an excellent fit.
Twitter and just about everywhere else: richardtallent