HEAL Data Dictionary¶
version 0.3.2
The aim of this HEAL metadata piece is to track and provide basic information about variables in a tabular data file (i.e. a data file with rows and columns) from your HEAL study. The objective is to list all variables and descriptive information about those variables. This will ensure that potential secondary data users know what data has been collected or calculated and how to use these data. Note that a given study can have multiple tabular data files; You should create a data dictionary for each tabular data file. Thus, a study may have multiple data dictionaries.
Highly encouraged
- Only name
and description
properties are required.
- For categorical variables, constraints.enum
and enumLabels
(where applicable) properties are highly encouraged.
- For studies using HEAL or other common data elements (CDEs), standardsMappings
information is highly encouraged.
- type
and format
properties may be particularly useful for some variable types (e.g. date-like variables)
Properties (i.e., fields or variables)¶
schemaVersion
(string) The version of the schema used in agreed upon convention of major.minor.path (e.g., 1.0.2)NOTE: This is NOT for versioning of each indiviual data dictionary instance. Rather, it is the version of THIS schema document. See
version
property (below) if specifying the individual data dictionary instance version.If generating a vlmd document as a csv file, include this version in every row/record to indicate this is a schema level property (not applicable for the json version as this property is already at the schema/root level)
Examples:
1.0.0
0.2.0
section
(string) The section, form, survey instrument, set of measures or other broad category used to group variables. Previously called "module."Examples:
Demographics
PROMIS
Medical History
name
(string,required) The name of a variable (i.e., field) as it appears in the data.Examples:
gender_id
title
(string) The human-readable title or label of the variable.Examples:
Gender identity
description
(string,required) An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).Examples:
The participant's age at the time of study enrollment
What is the highest grade or level of school you have completed or the highest degree you have received?
type
(string) A classification or category of a particular data element or property expected or allowed in the dataset.Must be one of:
number
,integer
,string
,any
,boolean
,date
,datetime
,time
,year
,yearmonth
,duration
,geopoint
format
(string) Indicates the format of the type specified in thetype
property. Each format is dependent on thetype
specified. See here for more information about appropriateformat
values by variabletype
.
constraints.required
(boolean) If this variable is marked as true, then this variable's value must be present (ie not missing; see missingValues). If marked as false or not present, then the variable CAN be missing.
constraints.maxLength
(integer) Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.
constraints.enum
(string) Constrains possible values to a set of values.Examples:
1|2|3|4|5
Poor|Fair|Good|Very good|Excellent
constraints.pattern
(string) A regular expression pattern the data MUST conform to.
constraints.maximum
(integer) Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.
constraints.minimum
(integer) Specifies the minimum value of a field.
enumLabels
(string) Variable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as "short" labels (such as abbreviations).
This field is intended to follow this pattern
Examples:
1=Poor|2=Fair|3=Good|4=Very good|5=Excellent
HW=Hello world|GBW=Good bye world|HM=Hi, Mike
enumOrdered
(boolean) Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).This field is intended to follow the ordering aspect of this [this pattern]this pattern
missingValues
(string) A list of missing values specific to a variable.Examples:
Missing|Skipped|No preference
Missing
trueValues
(string) For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.Examples:
required|Yes|Checked
required
falseValues
(string) For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.Examples:
Not required|NOT REQUIRED
No
custom
(string) Additional properties not included a core property.
standardsMappings[0].instrument.url
(string) A url (e.g., link, address) to a file or other resource containing the instrument, or a set of items which encompass a variable in this variable level metadata document (if at the root level or the document level) or the individual variable (if at the field level).Examples:
https://www.heal.nih.gov/files/CDEs/2023-05/adult-demographics-cdes.xlsx
standardsMappings[0].instrument.source
(string) An abbreviated name/acronym from a controlled vocabulary referencing the resource (e.g., program or repository) containing the instrument, or a set of items which encompass a variable in this variable level metadata document (if at the root level or the document level) or the individual variable (if at the field level).Must be one of:
heal-cde
standardsMappings[0].instrument.title
(string)Examples:
Adult demographics
adult-demographics
standardsMappings[0].instrument.id
(string) A code or other string that identifies the instrument within the source. This should always be from the source's formal, standardized identification systemExamples:
5141
standardsMappings[0].item.url
(string) The url that links out to the published, standardized mapping of a variable (e.g., common data element)Examples:
https://evs.nci.nih.gov/ftp1/CDISC/SDTM/SDTM%20Terminology.html#CL.C74457.RACE
standardsMappings[0].item.source
(string) The source of the standardized variable. Note, this property is required if an id is specified.Examples:
CDISC
standardsMappings[0].item.id
(string) The id locating the individual mapping within the given source. Note, thestandardsMappings[0].source
property is required if this property is specified.Examples:
C74457
relatedConcepts[0].url
(string) The url that links out to the published, related concept. The listed examples could both be attached to any variable related to, for example, heroin use.Examples:
https://www.ebi.ac.uk/chebi/chebiOntology.do?chebiId=CHEBI:27808
http://purl.bioontology.org/ontology/RXNORM/3304
relatedConcepts[0].title
(string) A human-readable title (ie label) to a concept related to the given field. The listed examples could both be attached to any variable related to, for example, heroin use.Examples:
Heroin Molecular Structure
Heroin Ontology
relatedConcepts[0].source
(string) The source (e.g., a dictionary or vocabulary set) to a concept related to the given field. The listed examples could both be attached to any variable related to, for example, heroin use.Examples:
CHEBI
RXNORM
relatedConcepts[0].id
(string) The id locating the individual concept within the source of the given field. The listed examples could both be attached to any variable related to, for example, heroin use.Examples:
27808
3304
End of schema - Additional Property information¶
type
enum definitions:
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., \"test\"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., \"2023-05-25\"))datetime
(A specific date and time, including timezone information. (e.g., \"2023-05-25T10:30:00Z\"))time
(A specific time of day. (e.g., \"10:30:00\"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., \"2023-05\"))duration
(A length of time. (e.g., \"PT1H\")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))
format
examples/definitions of patterns and possible values:
Examples of date time pattern formats
%Y-%m-%d
(for date, e.g., 2023-05-25)%Y%-%d
(for date, e.g., 20230525) for date without dashes%Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45)%Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z)%Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300)%Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30)%Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10)%H:%M:%S
(for time, e.g., 10:30:45)%H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z)%H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
Examples of string formats
email
if valid emails (e.g., test@gmail.com)uri
if valid uri addresses (e.g., https://example.com/resource123)binary
if a base64 binary encoded string (e.g., authentication token like aGVsbG8gd29ybGQ=)uuid
if a universal unique identifier also known as a guid (eg., f47ac10b-58cc-4372-a567-0e02b2c3d479)
Examples of geopoint formats
The two types of formats for geopoint
(describing a geographic point).
array
(if 'lat,long' (e.g., 36.63,-90.20))object
(if {'lat':36.63,'lon':-90.20})
standardsMappings
andrelatedConcepts
: If you want to add more than one value,adding anoth column with a name containing an added digit in brackets ([0]
-->[1]
-->[n]
).
Examples:
A table with 2 columns (fields) of the same variables:
standardsMappings[0].instrument.title |
standardsMappings[1].instrument.title |
---|---|
My first instrument | My second instrument |
A table with 3 columns (fields) of the same variables:
relatedConcepts[0].url |
relatedConcepts[1].url |
relatedConcepts[2].url |
---|---|---|
fakehttp://my-first-concept-url.org | fakehttp://my-second-concept-url.org | fakehttp://my-third-concept-url.org |