OCHRE DatabaseA description of the logical schema and querying mechanism of the efficient and scalable graph database that powers the OCHRE platform. Click the buttons below for more information on other OCHRE topics.
Representing Hierarchical Data in a Semistructured Graph Database
This web page updates the discussion of the topics below that were previously discussed in OCHRE: An Online Cultural and Historical Research Environment (Eisenbrauns, 2012) by David Schloen and Sandra Schloen. For detailed examples of a wide range of use cases, see Database Computing for Scholarly Research: Case Studies Using the Online Cultural and Historical Research Environment by Sandra Schloen and Miller Prosser (Springer, forthcoming).
The semistructured data model can be seen as a variant of the graph data model. The web-like “network” structure of a typical graph database has few constraints on how entities are related. This provides great flexibility but makes querying inefficient and semantically ambiguous. In contrast, a semistructured database harnesses the power of “tree” structures to represent hierarchical relations among entities (a tree is a kind of graph). A semistructured database can easily represent information as open-ended hierarchies of entities while also allowing cross-hierarchy links between entities in different trees to create a network of relations. A network-graph database could represent the same information but would be much less efficient when working with hierarchically structured data.
For discussions of the semistructured data model and XML, see Database Systems: The Complete Book, 2d edition, by H. Garcia-Molina, J. D. Ullman, and J. Widom (Upper Saddle River, NJ: Pearson Prentice Hall, 2009), pp. 483–515; and Dan Suciu, “Semi-structured Data Model,” in Encyclopedia of Database Systems, ed. M. Ling Liu and Tamer Özsu (Springer Link, 27 January 2017).
Semistructured databases are well suited for data that is already hierarchically structured. Such data is ubiquitous in the study of languages, texts, artifacts, and other cultural materials. Tree hierarchies very compactly represent parthood and group membership and make it possible for software developers to exploit the power of recursion, which is used extensively in the OCHRE software. Moreover, cross-hierarchy links from one tree to another make it possible to represent information as a web-like network graph, wherever appropriate, bypassing the constraints of the tree structure.
For example, relations of spatial containment are easily represented by means of recursive parthood hierarchies (see the discussion of Tree database items below) such that a spatially situated unit of observation contains smaller spatial units and these in turn contain still smaller spatial units, and so on down the hierarchy (e.g., an archaeological site contains many soil layers that each contain many artifacts). Likewise, it is easy to see how temporal, linguistic, and textual entities and relations can be modeled as recursive parthood hierarchies in which smaller entities are nested within larger entities of the same kind. Other kinds of entities and relations lend themselves to non-recursive grouping hierarchies that represent group membership rather than part-whole relations. For example, grouping hierarchies of historical persons can organize persons in groups and sub-groups without implying that one person is part of another.
In addition to hierarchical tree structures, semistructured databases can accommodate unstructured networks of entities because they include not just hierarchies but also cross-hierarchy links. They can also represent highly structured information, e.g., tables of data organized in rows and columns. This flexibility explains why XML and JSON, as notations for semistructured data, have become the standard formats for exchanging information on the Web.
In contrast, the relational data model that underlies most commercial databases is clearly the best model for highly structured data but is cumbersome to use for hierarchical data. Spatial, temporal, linguistic, textual, and taxonomic hierarchies are easily represented and queried in a semistructured XML database but require many inefficient table joins in a relational database. Likewise, network-graph databases are well suited for unstructured data in which there are few constraints on how entities are related. However, they are less efficient and harder to program than semistructured databases when dealing with recursive hierarchical data, for which XML provides a compact notation and structural constraints that enable efficient querying.
The relational data model, the network-graph data model, and the semistructured graph data model are all universal data models that can be used to represent information of any kind. However, each model has advantages and disadvantages, depending on the kind of data one is working with. The foundational ontology of OCHRE and resulting database item classes (described below in the section on “Ontological Classes of Items in the OCHRE Database”) could be — and once were — implemented in a relational database using tuples (table rows) and SQL. They could also be implemented in a network-graph database using RDF and SPARQL, or as a labeled property graph (LPG). However, implementing the ontology in a semistructured graph database using XML Schema and XML Query (XQuery) is the best approach, in principle, and is conducive to crafting elegant code that is very compact and efficient, in practice.
The OCHRE Database Uses XML Schema and XQuery Instead of SQL
The semistructured graph database on the back end of the OCHRE platform is based on the XML Schema and XML Query (XQuery) standards of the World Wide Web Consortium (W3C). The database runs on Tamino XML Server from Software AG, an enterprise-class native-XML database management system (DBMS) that has a highly optimized XQuery processor and supports indexing on the “ancestor axis” for efficient searching of hierarchically organized data. The OCHRE database is a transactional multi-user database with password-protected user accounts, record-locking, and mechanisms to ensure data security and disaster recovery. It is a professionally engineered and highly scalable database that meets the ACID requirements of atomicity, consistency, isolation, and durability.
Information in the OCHRE database is stored in a large number of Extensible Markup Language (XML) “documents” that conform to various predefined document types. These XML documents serve as the atomic keyed-and-indexed data objects in the database, or what we prefer to call “items,” as explained below. Each XML document has an internal element containing a universally unique identifier (UUID) that functions as its database key. The database maintains indexes on the keys of all the documents to ensure efficient querying and joining of data via XQuery. Links between database items are established by adding the UUID key of the target document to an element of the source document.
Note that XML documents need not correspond to real-world documents and in OCHRE’s highly atomized back-end database they rarely do (although this is not the case in the front-end publication database, which is simply a document store). The term “document” in this context reflects the fact that XML is a notation for serializing complex data structures as linear character strings using Unicode character codes (i.e., using one of the UTF encoding schemes). This means that any modern operating system that supports Unicode text files can handle data stored as XML documents. Since it was introduced in 1998, XML has become ubiquitous in computing because it provides a text-based and thus cross-platform notation for all kinds of data — structured, semistructured, and unstructured — and not just for documents as normally understood.
XML Schema is used to specify the elements and attributes of each XML document type in OCHRE’s back-end database. An XML Schema specification of the structure and sequence of an XML document’s elements and attributes is analogous to a relation schema for a relational database table, which defines the names, types, and sequence of the attributes (normally displayed as table column headings) that are associated with the attribute values in a tuple (normally displayed as a table row). The advantage of using XML Schema to specify the structure of the data objects in the OCHRE database is that their internal structures can be more complex than a tuple, i.e., they can contain internal tree structures represented by nested elements inside the XML document. This has many benefits for the kind of information stored in the OCHRE database.
The XML Query (XQuery) querying language is used to create, read, update, and delete the XML documents in the OCHRE database — the CRUD operations — and also to join them together into larger configurations based on their keys. The XML documents managed by a native-XML database are analogous to the tuples (table rows) in a relational database, i.e., they are the atomic data objects that are individually indexed and retrieved. Thus XQuery is analogous to the SQL querying language used in relational databases, although XQuery is a Turing-complete language, unlike SQL, and in principle is more powerful.
Relational tables and SQL are not well suited for semistructured data organized in open-ended hierarchies of entities whereas XML documents and XQuery are very well suited for such data. That is why XML Schema and XQuery were chosen to implement the OCHRE database, which makes extensive use of hierarchies to represent the parthood relations, class-subclass relations, and group-membership relations commonly found in the study of languages, texts, artifacts, and other cultural materials. This is explained further in the section below on “Ontological Classes of Items in the OCHRE Database.”
Your content goes here. Edit or remove this text inline or in the module Content settings. You can also style every aspect of this content in the module Design settings and even apply custom CSS to this text in the module Advanced settings.
Ontological Classes of Items in the OCHRE Database
The OCHRE database contains millions of small XML documents that represent entities of interest in a highly granular fashion. Each XML document conforms to an XML document type and the schemas of the various document types constitute the logical schema of the OCHRE database. The document types correspond, in most cases, to ontological classes in the foundational ontology (meta-ontology) implemented in the database. (For technical reasons, some XML document types implement more than one class and some classes are implemented by more than one document type.)
The classes of items in the OCHRE database are sometimes called “categories,” especially in older documentation, but that term is used in OCHRE in a non-philosophical way and is not meant to suggest Aristotelian or Kantian categories such as substance, quantity, quality, relation, etc.
Instances (individuals) of the OCHRE ontological classes are called “items,” emphasizing the notion that these are items of interest that have been singled out by agents who name them and make statements about them. Items of interest to scholars include mental concepts and linguistic utterances as well as spatiotemporal entities. Each item of interest that someone has singled out for discussion, no matter how small, is represented in the database by an XML document of the appropriate document type, e.g., something as small as a punctuation mark or diacritical mark in a text may be an item of interest that is represented by an individual data object in the OCHRE database. The many atomized items in the database are combined into larger configurations, as needed, to provide different views of the data to end users.
The ontological classes of OCHRE database items are described below. The brief descriptions here can be supplemented by consulting OCHRE: An Online Cultural and Historical Research Environment (Eisenbrauns, 2012), which is out-of-date in some respects but still useful (and uses the term category instead of class). Please note that this book and the user interface for adding content to the OCHRE database employ different names for some classes of items to make them less abstract and easier to understand; for example, Agent items are called “Persons & organizations,” Spatial items are called “Locations & objects,” Temporal items are called “Periods,” and Attribute items are called “Variables.”
Two of the classes of database items described below have subclasses: the Attribute category and the Hierarchy category. The fact that a database item belongs to a subclass of a main class is represented in the database, not by a different XML document type, but by an internal element within the XML document. The subclass specified in this internal element triggers the appropriate behavior in the software for handling an item belonging to that subclass.
A Project item represents a research project that controls the database items that have been added to the database by members of the project. All information stored in the OCHRE database is organized by projects. Each project’s director or designated project administrator determines who can view or edit the project’s data, which remains invisible to people who have not been given access to it.
All database items contain the unique identifier (database key) of the Project item to which they belong. There is one and only one Project item for each project. There is an overarching “OCHRE” Project item that owns all the other Project items. This top-level OCHRE project has its own database items, which are automatically inherited by other projects and visible to them, if they wish to use them. For example, the OCHRE project has taxonomic Attribute items and Value items, and also Concept items, for standardized terms and concepts that are shared by multiple projects (see below).
Grouping Hierarchy items (see below) are used to organize Project items in named groups and sub-groups that indicate associations among separate projects. Parthood Hierarchy items (see below) are used to organize Project items into nested recursive hierarchies to represent sub-projects that are a constituent parts of larger projects and not merely associated with them.
An Agent item represents a social agent of any kind, as defined by the project that created the item. An Agent item might represent an individual person or a group of people which may be real or fictitious and may be contemporary or historical. Every item in the database is attributed in some way to a person or persons represented by an Agent item, even if the attribution is made only implicitly to the project that added the items to the database.
The members of a project are represented individually by Agent items, which are linked to the database items they create when they enter data into the database. Agent items may also represent persons outside the project who are responsible for the project’s data in some way as authors, editors, observers, photographers, illustrators, resource creators, or data-entry staff. In this way, all the information in the database is normally attributed to one or more named agents.
Grouping Hierarchy items are used to organize Agent items in named groups and sub-groups.
A Spatial item represents a spatially situated unit of observation (or an imaginary spatial unit) of any size or kind, as defined by the project that created the item. Parthood Hierarchy items (see below) are used to organize Spatial items into nested recursive hierarchies that represent relations of strict spatial containment. For example, an archaeology or art history project might create Spatial items and organize them via a Parthood Hierarchy item to represent geographical regions containing settlements containing buildings containing artifacts containing components of those artifacts.
A Temporal item represents a temporal unit of any duration, be it a geological era or a cultural period or a sudden event, as defined by the project that created the item. Parthood Hierarchy items are used to organize Temporal items into nested recursive hierarchies that represent relations of temporal sequencing and sub-sequencing, i.e., sub-periods may be nested within longer periods of time. For example, a project in history might create Temporal items and organize them via a Parthood Hierarchy item to represent cultural ages containing historical eras containing political periods (e.g., royal dynasties) containing the reigns of particular rulers
In some cases, it will be necessary to use a Sequence item to organize Temporal items rather than a Parthood Hierarchy item. This will be the case when the temporal sequence or process being represented entails not just branching of separate sub-sequences but the re-convergence of sub-sequences into the main sequence, as in a Harris Matrix diagram of stratigraphic relationships in archaeology.
An Epigraphic item represents an epigraphic unit of any size, i.e., some part of the physical expression of a particular text. For example, an Epigraphic item could represent (in the case of codices) an entire book, a leaf in a book, a page surface on a leaf (recto or verso), a column (or perhaps a row-column table) on a page surface, a line within a column or table cell, a character within a line, or perhaps even a smaller grapheme within a character.
This hierarchy of possible epigraphic units is just an example; it is commonly used but OCHRE does not prescribe the way the text will be divided into epigraphic components. That is determined by the project that created the Epigraphic items or by the person who analyzed the text in accordance with the degree of atomization they need to do their research. In some cases, not just characters but diacritical marks will be represented by individual Epigraphic items. In other cases, it may be sufficient to have Epigraphic items that represent entire lines or pages and not subdivide the text any further.
Parthood Hierarchy items are used to organize Epigraphic items into nested recursive hierarchies that represent part-whole relations within the epigraphic dimension of the text, e.g., to show that the text consists of a book that contains leaves that contain page surfaces that contain columns that contain lines that contain characters that may contain smaller graphemes. Note that an Epigraphic item represents an actual region of inscription in a particular text. It does not represent an ideal sign or character in a writing system, which would be represented instead by a Sign item. Epigraphic items may contain link(s) to the relevant Sign item(s) instantiated in the text or even to a particular allograph of a sign, which is useful in some cases, but this is not required.
A Discourse item represents a discourse unit of any size, i.e., some part of the linguistic meaning of a particular text, large or small. For example, a Discourse item could represent the entire composition, a section or chapter (considered as a unit of discourse) within the composition, a paragraph within a chapter, a sentence within a paragraph, a clause within a sentence, a phrase within a clause, a word within a phrase, or a morpheme within a word.
This hierarchy of possible discourse units is just an example; they are commonly used units but OCHRE does not prescribe the way a text will be divided into discursive or grammatical components. That is determined by the project that created the Discourse items or by the person who interpreted the text, in accordance with the degree of atomization they need to do their research.
Parthood Hierarchy items are used to organize Discourse items into nested recursive hierarchies that represent part-whole relations within the discursive dimension of the text, e.g., to show that the text consists of chapters that contain paragraphs that contain sentences that contain clauses that contain phrases that contain words that contain morphemes. A Discourse item will normally contain links to the Epigraphic item(s) that have been read as expressing the discursive meaning represented by the discourse unit.
A Sign item represents a graphic sign or character in a writing system of any kind, e.g., alphabetic, syllabic, logosyllabic, or logographic. A Sign item usually contains a Unicode codepoint so it can be displayed using a Unicode font, but if Unicode does not contain the relevant sign then an image of the sign or the conventional Roman-alphabet transcription of its name can be included within the Sign item. A Sign item may contain information about the different reading values of a sign and its allographic variants (e.g., upper-case “A” and lower-case “a” are allographs of the same sign in the alphabetic writing system used in this website, and this sign also has different phonetic reading values in different contexts).
Grouping Hierarchy items are used to organize Sign items in named groups and sub-groups to represent a writing system. The hierarchical grouping of a Sign item within another Sign item is used to represent a compound sign.
A Text item normally contains a link to a Parthood Hierarchy item that represents a hierarchy of Epigraphic items plus a link to another Parthood Hierarchy item that represents a hierarchy of Discourse items. In this way, the Text item represents a particular text in both its epigraphic and discursive dimensions, which are too often conflated in digital humanities.
Note, however, there is no requirement that a Text item be linked to both kinds of hierarchy. In some cases, only the epigraphic dimension of the text is represented (e.g., in the case of an undeciphered text) or only the discursive dimension is represented (for research that can ignore the physical expression of the text). Likewise, there is no requirement that the Text item be linked to only one epigraphic hierarchy and only one discourse hierarchy. Multiple analyses of the same text on both the epigraphic and discursive levels can be represented by linking to any number of Parthood Hierarchy items representing multiple epigraphic hierarchies and multiple discourse hierarchies. Branching within the same hierarchy can also be used to represent different readings of portions of the text.
Text items that use Parthood Hierarchy items to organize Epigraphic items and Discourse items can represent texts in any genre and language and written using any writing system. They can even represent born-digital texts. However, Epigraphic, Discourse, and Text items are normally used only to represent texts that are objects of study and analysis in their own right in the context of historical or literary research. Other kinds of digital texts (e.g., scholarly reports and secondary literature) are normally represented by Resource items (see below). But the decision concerning how to represent a given text is up to the project.
Grouping Hierarchy items are used to organize Text items in named groups and sub-groups.
A Lexical item represents a word (lemma) in a dictionary or glossary for a particular language. A Lexical item may contain just one meaning (definition), several meanings, or a hierarchy of meanings and sub-meanings for each grammatical form of a word, with the option of including textual citations of the use of the grammatical form in context. Orthographic variants of each grammatical form of the word may also be included. Depending on how much detail is included, a Lexical item may contain only a brief glossary entry or a full OED-style dictionary entry.
Lexical items contain links to Discourse items that instantiate the word in its particular grammatical forms and orthographic renditions within particular tests. Grouping Hierarchy items are used to organize Lexical items in named groups and sub-groups to represent a dictionary or glossary for a particular language or dialect.
A Bibliographic item represents a bibliographic reference to a published work. OCHRE can link Bibliographic items to the Zotero online citation system to automatically populate the content of the bibliographic reference and style it according to the user’s preference. Grouping Hierarchy items are used to organize Bibliographic items in named groups and sub-groups to represent a citation list or bibliography.
A Resource item represents an external digital resource of any kind that resides outside the OCHRE database and is fetched dynamically as needed from an FTP server or HTTP Web server; for example, a 2D image, 3D model, document, spreadsheet, geospatial shapefile, audio file, video clip, etc. A Resource item contains the name, description, file format, and URL of the external digital resource. Like any other OCHRE database item, a Resource item can be linked to Attribute items which are linked in turn to Value items (or themselves contain values) to represent the properties of the digital resource, such as its metadata.
Grouping Hierarchy items are used to organize Resource items in named groups and sub-groups.
A Concept item represents a project-defined concept that does not correspond to any of the built-in OCHRE classes or subclasses. For example, units of measurement are represented as Concept items, which can be linked to Attribute items to indicate the units of a numeric attribute. Likewise, artifact styles can be represented as Concept items, or any concept a project needs to define and relate to other database items.
Concept items that will be shared by multiple projects (e.g., units of measurement and other standardized concepts) can be created within the overarching “OCHRE” project, whose database items are available to all other projects. The OCHRE support team will work with project teams to determine which of their Concept items should be shared with other projects and to help them use existing Concept items. However, individual projects are not required to use shared Concept items and are free to create their own.
Class Hierarchy items are used to organize Concept items into nested recursive hierarchies that represent class-subclass relations.
Grouping Hierarchy items are used to organize Concept items in named groups and sub-groups.
An Attribute item represents a taxonomic attribute or variable (qualitative, quantitative, or relational) that has been defined by the project or borrowed from another project. Any database item in any category may contain a link to one or more Attribute items that indicate its properties. Each Attribute item will either contain a link to a Value item (for qualitative nominal or ordinal attributes) or else the Attribute item will itself contain the value (for quantitative and relational attributes). The term “property” is used in OCHRE to refer to the attribute-plus-value, not just the attribute alone.
Linking database items in this way results in an item-attribute-value triple that makes a “statement” about an entity, analogous to the subject-predicate-object triples in RDF. Each such statement in the OCHRE database can be credited to a named author or observer by means of a link to an Agent item.
Grouping Hierarchy items are used to organize Attribute items in named groups and sub-groups. In addition, there is a Taxonomic Hierarchy item for each project that organizes its Attribute items and Value items in a taxonomic hierarchy (described in more detail below).
An Attribute item may contain internal links to one or more other Attribute items with an indication of the semantic relation between them: “close match” (synonym), “broader term,” “narrower term,” or “related term.” This permits a thesaurus-style view of a project’s terminology to be generated from a set of Attribute items. The set itself will be represented by a Set item.
There are several subclasses of Attribute items:
An Nominal Attribute item contains the name and description of a nominal attribute whose possible values have no inherent order. A nominal property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Nominal subclass. The Attribute item in turn contains a link to a Value item.
An Ordinal Attribute item contains the name and description of an ordinal attribute whose possible values have a rank order (e.g., the values “small,” “medium,” and “large” for an ordinal attribute called “Size”). An ordinal property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Ordinal subclass. The Attribute item in turn contains a link to a Value item.
A Boolean Attribute item contains the name and description of a logical attribute that takes a Boolean value, i.e., true or false. A Boolean property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Boolean subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:Boolean data type.
An Integer Attribute item contains the name and description of a numeric attribute that takes an integer value. An integer property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Integer subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:integer data type.
A Decimal Attribute item contains the name and description of a numeric attribute that takes a decimal value. A decimal property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Decimal subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:decimal data type.
A Date Attribute item contains the name and description of an attribute that takes a calendar date as its value. A date property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Date subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:date data type.
An Coordinates Attribute item contains the name and description of an attribute that takes map coordinates as its value. A coordinates property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Coordinates subclass. The value of the attribute is stored internally in the Attribute item in XML elements that store unprojected geographical coordinates or planar coordinates with information about the associated map projection (e.g., UTM coordinates).
An Alphanumeric Attribute item contains the name and description of an attribute that takes an alphanumeric string as its value. An alphanumeric property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Alphanumeric subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:string data type. Alphanumeric Attribute items are similar to Nominal Attribute items, but Alphanumeric Attribute items are not linked to Value items and are not part of the project’s taxonomy.
A Serial Number Attribute item contains the name and description of an attribute that takes an integer serial number as its value. A serial number property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Serial Number subclass. The value of the attribute is stored internally in the Attribute item in an XML element of the xsd:integer data type. Serial Number Attribute items are similar to Integer Attribute items but they behave differently because the software automatically increments the serial number each time the Attribute item is used.
A Relational Attribute item contains the name and description of an attribute that takes as its value the UUID database key of another item in the database. A relational property is attributed to a database item by linking that item to an Attribute item which has an internal XML element that indicates it belongs to the Relational subclass. The value of the attribute is the UUID database key of the related item, which is stored internally in an XML element in the Attribute item.
Relational Attribute items are used for project-defined relations between items, spanning hierarchies and item classes and supplementing the inter-item relations created by means of Tree items (see below). In other words, a named relation between any two items in the database can be created using a Relational Attribute item. For example, a Spatial item might contain a link to a Relational Attribute item named “is above” that contains as its attribute value the UUID database key of another Spatial item to represent the fact that one thing is situated spatially above another.
Database items that have been linked together by Relational Attribute items constitute a network-graph structure. The OCHRE user interface can use these inter-item relations to display visualizations of network graphs using node-link diagrams and can analyze the networks using standard graph-analysis algorithms. This is valuable for social network analysis, for example, to identify clusters and cliques in a social network of Agent items.
The combination of the hierarchies of database items organized by Hierarchy items and the cross-hierarchy links created by Relational Attribute items — not to mention the internal XML hierarchies often found within database items in other categories — yields a semistructured graph database that can represent any kind of information.
A Value item represents a qualitative nominal or ordinal value that has been defined by the project or borrowed from another project. A qualitative property is attributed to a database item by linking that item to a Nominal Attribute item or an Ordinal Attribute item (see above) that in turn contains a link to a Value item, which stores the qualitative value internally as a character string.
Many different item properties can therefore use the same value by pointing to the same Value item. This avoids error-prone and storage-consuming duplication of data and maintains the regularity and “cleanliness” of the data because the name of the value exists in only one place in the database and a change in the value’s name will be propagated instantly wherever it is displayed. This also permits the organization of qualitative values as taxonomic terms independently of the properties in which they are used.
Unlike qualitative values, quantitative integer and decimal values do not need to be defined in OCHRE because the ontological class of numbers is already predefined for everyone and is digitally represented by standard data types, i.e., the XML Schema (XSD) data types) that are interpreted in the same way everywhere on the Web. The same is true of Boolean values, calendar dates, and map coordinates.
A Value item may contain internal links to one or more other Value items with an indication of the semantic relation between them: “close match” (synonym), “broader term,” “narrower term,” or “related term.” This permits a thesaurus-style view of a project’s terminology to be generated from a set of Value items. The set itself will be represented by a Set item.
A Query item represents the search criteria for a database query that can be executed to select database items. Search criteria can be named and saved in a Query item for repeated use. The criteria can be quite complex, involving both the intrinsic attribute-value properties of database items and their extrinsic relations with other items. The extrinsic relations among database items can be specified by Tree items (see below) or by Relational Attribute items.
Boolean algebraic operators (AND, OR, NOT) and relational operators (< , > , <= , >= , == , !=) are supported in query expressions, as well as the nesting of expressions via parentheses. Readable query expressions are constructed in the back-end user interface of the database via drop-down pick lists. The user can easily specify the scope of the query by selecting the projects and item categories to include and then specify the attributes, attribute value ranges, and operators of the search criteria.
When the query is executed, the user-created query criteria are automatically converted to XQuery and sent to the native-XML database management system for execution, returning a list of item identifiers (UUID database keys) as the query result set. Separate queries can be chained sequentially to select items based on the intersection or union of their result sets.
Set items are used to store query result sets, i.e., the set of database items found by a query. Grouping Hierarchy items are used to organize Query items in named groups and sub-groups.
A Set item organizes other database items in a set (i.e., a list) that may be ordered or unordered and may contain different classes of items. Database items organized by a Set item can be displayed in the user interface as a list of item names that users click to call up the individual item descriptions or they can be displayed as a table with rows and columns. In such a table, each item in the set is shown as a row and the properties of the items (their attribute-value pairs) are shown as columns, i.e., the attributes are the column headings and the values are shown in the table cells. Database items organized by a Set item can also be displayed in the user interface as network graphs using node-link diagrams if they have Relational Attributes that link one item to another.
Executing a query in OCHRE yields a set of database items that is organized by a Set item. The query results can thus be saved and displayed as a table or network graph for further analysis.
Set items also come into play when a project director decides to publish data from the back end of the OCHRE platform to the front end. When this happens, database items are transformed into a set of “denormalized” (structurally simplified) XML documents or equivalent JSON documents suitable for use by Web app developers via the OCHRE Web API. A Set item is used to organize each set of items that have been published and stores metadata about the publication.
Unlike the small XML documents that constitute highly atomized database items inside the database, each published document normally corresponds to a real-world entity such as an artifact or a text, or some other entity or topic that app developers will prefer to handle as a pre-packaged unit of information. Even though a published XML or JSON document is not a database item that conforms to one of the XML document types in the back-end database, it contains persistent URLs to individual items of information represented as elements within the document. These URLs contain the UUID keys of the database items to which they correspond. Thus, Set items stored in the back-end database can use these keys to keep track of what has been published to the front end of the platform.
A Sequence item organizes other database items into sequential structures that may branch and re-converge. This is useful for representing a timeline of temporal events or a flowchart of processes in which a sub-sequence may diverge and then re-converge with the main sequence.
A Sequence item has an internal tree graph structure to represent the branching of sub-sequences from the main sequence but it is not strictly a tree because it allows branching paths to re-converge, forming a directed acyclic graph (DAG). This kind of graph structure is widely used to represent causal relations. In archaeology, it is used to model stratigraphic sequences, as in the Harris Matrix diagramming method.
Grouping Hierarchy items are used to organize Sequence items in named groups and sub-groups.
A Hierarchy item has an internal tree structure (understood mathematically as a type of graph) that is used to organize other database items in hierarchies and lists. Hierarchy items exist separately from the many other atomized database items they organize. Different Hierarchy items may be linked to the same database item, which is thereby included in different hierarchies reflecting different interpretations of the data. Alternatively, a single Hierarchy item may contain multiple links to the same database item from different locations in its hierarchy, which allows the branches of a single hierarchy to represent different interpretations of the same items. By allowing an item to occur in more than one hierarchy or in more than one branch of the same hierarchy, the OCHRE database can represent multiple overlapping configurations of entities that reflect multiple interpretations without duplicating any of the database items.
The metaphor of a hierarchical “tree of knowledge” has a long history. It underlies the Porphyrian Tree attributed to the third-century CE Neoplatonist philosopher Porphyry of Tyre. In his introduction to Aristotle’s Categories, Prophyry discussed the classification of the categories in a way that later gave rise to tree-like diagrams consisting of branching divisions in which high-level categories are successively differentiated. The genus-species binomial nomenclature of modern biology introduced by Carl Linnaeus is intellectually indebted to the medieval Porphyrian Tree.
Hierarchical tree structures have become such a common way of conceiving of entities in the world that it is useful to use Hierarchy as a basic class in OCHRE without claiming that hierarchies are the only way, or always the best way, to organize information. In theory, we could have added another basic class of Network items, in addition to Hierarchy items, using a different metaphor. But a non-hierarchical web-like network of items (including a network of taxonomic terms) can be easily represented by a Set item that lists database items which themselves contain links to Relational Attribute items whose values are the unique identifiers (database keys) of other items, forming a network of labeled relations among database items.
The alternative would be to add two additional basic classes to the ontology: a class of Network items and a class of Relation items. A Network item would organize a set of Relation items, each of which would contains links to two database items that are being related to one another and would also store the name, description, directionality, and weight of a relation between those two items. However, a relation can be regarded as a kind of property (though this is debated among philosophers). A relation between two items in a network is represented in OCHRE as a property of the source item that links it to a target item by means of a Relational Attribute item whose value is the unique identifier of the target item (a property is an attribute-value pair). In addition, many of the relations between items are represented by Sequence items and Hierarchy items.
Hierarchy items are used to represent part-whole relations (parthood), class-subclass relations, grouping and sub-grouping relations, and taxonomic hierarchies. There are thus four subclasses of Hierarchy items, listed below, which share the same XML document type but contain an element that identifies the subclass to which a given item belongs, triggering different behaviors in the user interface.
A Parthood Hierarchy item organizes database items in a recursive nested hierarchy or tree in which each item is the child of a parent item that belongs to the same class, e.g., Spatial items can be recursively nested within Spatial items, Temporal items within Temporal items, Epigraphic items within Epigraphic items, and Discourse items within Discourse items. A parent item may have one or more child items but each child item has one and only one parent, thus there is a single parent item at the root of the tree.
The meaning of the hierarchical relations represented in a Parthood Hierarchy item depends on the class of items it organizes. In the case of Spatial, Temporal, Epigraphic, and Discourse items, the hierarchy represents part-whole relations (mereological parthood).
A Class Hierarchy item organizes Concept items in a recursive nested hierarchy that represents class-subclass relations. Note that some philosophers consider class-subclass relations to be equivalent to parthood relations (see, e.g., David K. Lewis, Parts of Classes ).
An Grouping Hierarchy item organizes other database items in a non-recursive hierarchy or tree of parent items and child items that may belong to different categories. A Grouping Hierarchy item does not represent parthood relations or class-subclass relations but rather the grouping and sub-grouping of items that are associated with one another on same basis. For example, catalogues of Resource items can be represented by means of Grouping Hierarchy items.
A Grouping Hierarchy item can be used to organize other Grouping Hierarchy items in named groups and sub-groups.
A Taxonomic Hierarchy has a modified internal tree structure that organizes Attribute items and Value items in a taxonomic hierarchy in which Attribute items alternate with Value items at successive levels of the hierarchy. This allows a project to specify the allowable values for each qualitative attribute by making them children of that attribute in the tree.
It is possible for an Attribute item to be a child of a Value item that is itself a child of the same Attribute item. This recursive structure, repeating the same Attribute item at lower levels of the taxonomic hierarchy as a descendant of itself, represents the genus-species relation between more general and more specific values of an attribute, allowing queries to use a general term to find more specific terms within the taxonomic hierarchy and vice versa.
Each OCHRE project has one and only one Taxonomic Hierarchy item, which specifies the taxonomy used in the properties of database items owned by that project.
Linking OCHRE Database Items to External Controlled Vocabularies
Several classes of OCHRE database items can be linked semantically to external controlled vocabularies of terms and concepts such as WikiData and the Getty Vocabularies. This can be done for the following classes: Agent, Spatial, Temporal, Text, Resource, Concept, Attribute, and Value. (For descriptions of these classes, see the section on “Ontological Classes of Items in the OCHRE Database.”)
Users can enter and save SPARQL queries associated with an item in any of these eight classes. The SPARQL queries are used to search a given external vocabulary and find the URLs of published concepts that could be linked to the item. An OCHRE item can be semantically linked to one or more external terms or concepts from any number of published vocabularies. The semantic linkage may be characterized as a “close match” (synonym), “broader term,” “narrower term,” or just a “related term.” If desired, the external term can be displayed in the OCHRE user interface as the name of the item instead of using a project-defined name. This will often be appropriate in the case of a close semantic match, allowing projects to employ standard terms curated by reputable organizations in various domains of research, such as the Getty Research Institute in the domain of cultural heritage.
These external semantic linkages clarify the meaning of terms used by OCHRE projects and provide interoperability with other systems. They solve the problem of homographs (i.e., words that have the same written form but different meanings, such as “light” in weight versus “light” in color). They allow OCHRE projects to employ any language, not just English, and to translate their terms using standard terminologies.
More generally, semantic linkages to external controlled vocabularies enable cross-project querying within the OCHRE environment among projects that use different nomenclatures. If each project links its terms to one or more external controlled vocabularies, an OCHRE database query can retrieve similar items that have been described differently by different projects. Alternatively, a project can borrow a taxonomy or part of a taxonomy from another project entirely within the OCHRE database platform itself, as long as the second project has made its taxonomy public for other OCHRE projects to use. This provides another (and often more efficient) way to achieve semantic integration among projects.
OCHRE Data Analysis and Visualization
The user interface for the database on the back end of the OCHRE platform provides a mechanism for scholars to build and execute powerful queries with search criteria that include both the intrinsic properties of database items and their extrinsic relations to other items (see the discussion of Query items in the section above on “Ontological Classes of Items in the OCHRE Database”). The query results can be saved as a set of database items that may then be displayed in tabular form and subjected to basic statistical methods for analyzing and visualizing the data. The data also be visualized in geographical maps using OCHRE’s GIS features, where appropriate, or in node-link network-graph diagrams.
In addition to these built-in features for data analysis and visualization, the OCHRE database can interact with an external R server in a seamless fashion to do more advanced analysis and visualizations. R is a programming language and a set of related software tools for statistical computing and graphics. It is free and open-source, and has become very popular in recent years.
The use of R with OCHRE is currently under development. More information will be available soon.
OCHRE query results can be formatted as R data frames and sent to the R server together with R commands that will execute code on the R server to perform the desired analytical procedures. The numerical and graphical results of the analysis are then sent back from the R server to OCHRE, where the user can save them in the database as Resource items for later use.
In addition to the built-in R functions, there are many pre-written R packages available to perform a wide variety of procedures, ranging from simple univariate and bivariate statistics to complex multivariate statistics, as well as specialized kinds of data analysis, such as natural language processing (NLP), social network analysis (SNA), spatial analysis, and machine learning. R packages can make use of code libraries written in other languages such as FORTRAN, C/C++, Java, or Python. Thus, R provides a mechanism for running Python code, for example (e.g., the NumPy and SciPy libraries), if a project wishes to do so.
Users who know R can enter commands directly into a data-aware R console inside the user interface of the back-end OCHRE database. They can save the R commands they have entered for repeated use. Commands in the console allow them to submit data to the R server from external CSV and Excel (XLSX) files or from dynamic OCHRE queries. Outputs from the R server are then displayed to the user in a separate window. These outputs can be named and saved in the database as Resource items.
In addition to, or instead of, entering commands in the R console, a project can use YAML or JSON to script multi-step analytical workflow jobs that (1) perform OCHRE queries, (2) execute R functions to analyze the query results, and (3) specify the outputs to be returned from the R server (PDFs, images, etc.). These workflows can be named and saved by a project for use by people who do not know R or do not want to write their own scripts.
When a workflow script is executed, the user is prompted to specify any external files to be used in the analysis and to supply run-time arguments to pass to the parameters of the chosen queries and R functions, in order to customize them for the current job. The progress of the job is echoed in the R console window. Scripted workflow jobs can be chained, such that the output of one job is the in-memory input (data frame) for the next. Both the workflow scripts and the outputs can be named and saved in the database as Resource items for repeated use.
Finally, instead of performing data analysis on the back end of the OCHRE platform, the Web apps provided by the Forum for Digital Culture can be used to analyze and visualize data published to the front end via the OCHRE Web API.
These Web apps are currently under development. More information will be available soon.
OCHRE Data Publication
The OCHRE platform is the basis for the Online Publication Service of the Forum for Digital Culture. Research projects that store their data in the OCHRE database can publish it to the Web in a permanently accessible, citable, and open-access fashion via the OCHRE Web API and a suite of Web apps provided by the Forum for viewing and analyzing the data.
More information will be available soon.