Edge API User Guide¶
General context¶
The goal of the Edge package is to manage experimental data. It provides tools to create structured models of the type of data that will be stored, ingest that data to a central data catalog, and search and retrieve it. This document shows how to interact with the package using the API (separate documents show how to use the GUI application).
About the data model¶
The Edge data model consists of 'data model' (EntityKind
) instances
which each describe a real-world object (e.g., sample, instrument,
image) and contain a hierarchical organized tree structure. This model consists
of data group (FactNode
) branch nodes and data field (FactKind
) leaf nodes.
The data fields each represent a particular type of information associated with
the parent data model (e.g., individual recipe or performance characteristics). Each
data field contains a 'configurable' (ValueConfiguration
) instance, which
describes the units and data type (e.g, float, int, string) of the value. There
may optionally be additional configurables describing configurable parameters
of the value (e.g., measurement temperature). While the EntityKind
and
FactKind
classes represent the types of known data, the Entity
and
Fact
classes represent the corresponding realized instances of them.
Connecting to a Data Catalog¶
Interaction with the Data Catalog is performed via an EdgeService
instance. This
service can be initialized in two different ways.
With a configuration file¶
The Edge application stores its catalog connection details in a YAML
file, which is stored at ~/.enthought_edge/data_catalog_config.yml
.
This file can also be used for connecting to the catalog outside of the application:
from edge import api
service = api.connect(
config_path='/path/to/config/file.yml',
)
Without a config file¶
A host URL and catalog name can be used to initialize the service:
from edge import api
service = api.connect(
catalog_host="catalog.enthought.com",
catalog_name="edge_demo1",
catalog_port=5000 # Optional, defaults to 5000
)
If no authentication arguments are provided, a hatcher token will be searched
for first in the environment variable $HATCHER_TOKEN
and then in the
~/.edm.yaml
file.
Otherwise, authentication can be provided by passing a hatcher token as an argument:
from edge import api
service = api.connect(
catalog_host="catalog.enthought.com",
catalog_name="edge_demo1",
api_token='YOUR_HATCHER_TOKEN',
)
or a username and password (This example uses getpass to prompt the user for their password without saving it in the notebook):
from edge import api
from getpass import getpass
service = api.connect(
catalog_host="catalog.enthought.com",
catalog_name="edge_demo1",
username='YOUR_USERNAME',
password=getpass()
)
Retrieving saved data models from the catalog¶
Data model retrieval¶
data_model = service.get_entity_kind(
'Edge_Example' # Name of the data model
)
Looking up an object in the data model from its path¶
It is also possible to access the various components (Data Group, Data Field and Configurables) in a given data model, provided their paths.
The path is just a string representing the location of the object within the data model. It is the names of the ancestors of that object, each separated by a '/', plus the name of the object itself. For example, given a data model with the following structure:
- Edge_Example
- Evaluation
- Mechanical Properties
* Tensile Strength (MPa)
+ Temperature (ºC)
- Physical Properties
* IR Spectrum
the path representing the data group 'Evaluation' would be 'Edge_Example/Evaluation', the path representing the data field 'Tensile Strength' would be 'Edge_Example/Evaluation/Mechanical Properties/Tensile Strength' and the path representing the configurable 'Temperature' would be 'Edge_Example/Evaluation/Mechanical Properties/Tensile Strength/Temperature'
# Get a data group
api.lookup_data_group(
data_model,
'Edge_Example/Evaluation')
# Get a data field
api.lookup_data_field(
data_model,
'Edge_Example/Evaluation/Mechanical Properties/Tensile Strength')
# Get a configurable
api.lookup_configurable(
data_model,
'Edge_Example/Evaluation/Mechanical Properties/Tensile Strength/Temperature')
Visualising data models¶
The structure of a data model can be visualised by printing it. Data groups are
prepended with a -
, data fields are prepended with a *
, configurables are
prepended by a +
, and children are indented underneath their parents. For the model
constructed above, this is the result:
>>> print(data_model)
- Edge_Example
- Condition
* Extruder
* Screw Type
* Screw Speed (rpm)
- Evaluation
- Mechanical Properties
* Tensile Strength (MPa)
+ Temperature (ºC)
- Physical Properties
* IR Spectrum
- Composition
* Base (%)
* Flame Retardant (%)
+ Purity (%)
* Timestamp
Data search and export¶
To retrieve a subset of data already in the catalog as a Pandas DataFrame, use the search function. This takes as arguments - an EdgeService instance (see Connecting to a Data Catalog), - a list of data models to search (see Data model retrieval), - a list of criteria to filter the facts by (consisting of a tuple containing the path to the data field, the name of the configurable to filter on, a condition, and a value to compare the configurable to), and - a data table to set which columns you want in the returned DataFrame (see Data Tables)
# Create the search query
# List of Search Terms
all_search_terms = [
('Example/Composition/Flame Retardant', 'value', 'Exists', None),
('Example/Condition/Screw Speed', 'value', '>', 1),
]
search_terms = []
for path, config_name, condition, value in all_search_terms:
data_field = api.lookup_data_field(data_model, path)
search_terms.append(((data_field, config_name), condition, value))
# Retrieve a data frame containing facts matching the search criteria
df = api.search(
service,
data_models=[data_model],
criteria=search_terms,
data_table=data_table # Use a data table to only include the desired columns
)
# Export to .csv and .xlsx
df.to_csv('search_results.csv')
df.to_excel('search_results.xlsx')
Tidy data¶
The search function returns pandas data frames following the 'tidy data' paradigm (see here and here). Briefly, 'tidy' datasets avoid containing any values associated to a measurement in headers, so columns for 'Pressure @ 20ºC' and 'Pressure @ 30ºC' would be represented instead as:
Entity Id. |
Pressure (Pa) |
Temperature (ºC) |
---|---|---|
001 |
10 |
20 |
001 |
15 |
30 |
002 |
20 |
20 |
002 |
25 |
30 |
003 |
30 |
40 |
This means that data from some entities can be split across multiple rows. It also allows for entities which have “redundant” Facts; those with the same configuration, but different values.
If you must, you can "untidy" the DataFrame:
df.pivot(
index='Pressure (Pa)',
columns='Entity Id.',
values='Temperature (ºC)')
results in
Pressure (Pa) Entity Id |
001 |
002 |
003 |
---|---|---|---|
10 |
20 |
NaN |
NaN |
15 |
30 |
NaN |
NaN |
20 |
NaN |
20 |
NaN |
25 |
NaN |
30 |
NaN |
30 |
NaN |
NaN |
40 |
Data Tables¶
When searching for data, a DataTable
object can be used to apply further
levels of filtering and customization to the output.
Creating a data table¶
A DataTable
can first be created for the EntityKind(s) of interest:
from edge.model import DataTable
screw_type_fact_kind = api.lookup_data_field(data_model, 'Example/Condition/Screw Type')
data_table = api.create_data_table(
'example_table',
data_models=[data_model],
data_fields=[screw_type_fact_kind]
)
Adding data fields to a data table¶
Next, FactKinds (and optionally a column name) can be added to the
DataTable
after it is created.
If no column name is given, the name of the FactKind is used.
screw_speed_fact_kind = api.lookup_data_field(
data_model_2, 'Example 2/Condition/Screw Speed')
data_table.add_column(screw_speed_fact_kind, col_name="Screw Speed")
Creating links between Data Models in a Data Table¶
Data Tables also offer the possibility to link/join columns corresponding fields or configurables in different data models. Two different models may both contain fields referring to the same real-world fact. Joining them allows a more efficient comparison between datasets, or an easier aggregation of data. This is discussed in more detail in Data Links.
Let's suppose you have two data models, one describing a chemical formulation,
and one describing an experiment using this formulation. You can produce two
sets of data for each of these data models, but data links allow you to merge
them into a single dataframe. For this, if you have a DataTable data_table
which already contains the formulation data model, you can add the experiment
data model experiment_data_model
to the table using:
data_table.add_entity_kind(experiment_data_model)
Then, you can indicate how to link these two data models by doing:
api.create_data_link(
data_table,
'Formulation | Name',
'Experiment | Formulation Name',
'outer')
The strings 'Formulation | Name' and 'Experiment | Formulation Name' refer to
the column names in the Pandas DataFrame, which can be found using
df.columns
.
When passing the data table containing this link to the search function, the resulting DataFrame will join together the rows which have a matching name/formulation name. The type of link can be "inner" (only data that shares the same value for the formulation name will be fetched) or "outer" (all data satisfying the search query will be fetched). The default is "inner".
Saving/Loading data tables¶
Data Tables can also be saved and loaded via the Data Catalog.
# Saving a DataTable to the catalog. Note: Saving a DataTable requires it
# to have a unique name
service.add_data_table(data_table)
# Loading a DataTable
loaded_data_table = service.get_data_table(name='example_table')
Data import¶
Creating Entities and Facts¶
Once the data model has been constructed, the API can be used to ingest data, i.e. creating entities and facts.
Importing from an Excel File¶
Before using Edge Scripting Tool API to import data from files, an
ImportRecord
instance needs to be created in the
Data Importer
of Edge
. This record relates a data model to excel files, by
describing how data corresponding to data fields in the model can be extracted
from the spreadsheets.
Providing an ImportRecord and an excel file, we can extract data in the form of
entities and their facts using the parse_file_to_entity_group
function as
follows:
template = api.get_import_record(
service, data_model, 'Formulation Standard')
parsed_entity_group = api.parse_file_to_entity_group(
excel_file, data_model, template)
This creates an EntityGroup, which is a bundle of a collection of entities along with their associated facts, containing the data from the excel file now mapped to the data model.
Creating Data Manually¶
It is also possible to create data without an excel file, by creating an Entity, and Fact objects associated to it. The following code block illustrates how to do so:
# Create a new entity of the kind specified in the above data model
entity = data_model.create_entity()
# Retrieve relevant FactKind instances to make some Facts
screw_type_fact_kind = api.lookup_data_field(
data_model, 'Example/Condition/Screw Type')
screw_speed_kind = api.lookup_data_field(
data_model, 'Example/Condition/Screw Speed')
# Create some facts about the new entity
facts = [
screw_type_fact_kind.create_fact(entity, value='Type A'),
screw_speed_kind.create_fact(entity, value=10.0),
]
Saving Entities and Facts to the database¶
# Add the entity
service.add_entities([entity])
# Add the facts
service.add_facts(facts)
# Alternatively, bundle the entity and facts into an EntityGroup, and upload
# them together
entity_group = api.create_entity_group(data_model, entities, facts)
service.add_entity_group(entity_group)
Note that Entity
or EntityGroup
instances can only be added if their
parent EntityKind
already exists in the database (see Saving the data model)
Likewise, Fact
instances can only be added once their parent Entity
has
been added.
Saving Facts which already exist in the database¶
If you are modifying an existing Fact, instead of saving a Fact in the
database for the first time, then to add it back to the database you must use
update_facts
instead of add_facts
:
service.update_facts(facts)
Raw data import¶
A command-line utility is available for importing raw data (e.g. images and
other files) from structured directory hierarchies.
The basic usage requires only a path
for the directory containing the files
to import, and a fileset_depth
integer indicating the depth relative to the
import path to consider files part of the same logical unit for parsing.
Typical basic usage may look something like:
python -m edge.io.raw_data_import --path /path/to/data --fileset-depth 3
See the API documentation for more detail and advanced usage of this utility.
See also the Endex documentation on Bulk Data Import, which is the basis for this utility.
Data modeling¶
Creating a data model¶
If the data model creation is likely to be a one-off task, it may be easiest to create the model via the UI and save it to the catalog, and then retrieve it when required using the service (see Data model retrieval)
However, data models can be created using the scripting API.
create_configurable
creates a configurablecreate_data_field
creates a data fieldcreate_data_group
creates a data groupcreate_data_model
creates a data model
Creating data fields¶
Below is an example of creating some data fields for a data model, including an extra configurable for one of those data fields:
from edge import api
timestamp_kind = api.create_data_field(
'Timestamp', value_type='datetime')
extruder_kind = api.create_data_field(
'Extruder', value_type='str')
screw_type_kind = api.create_data_field(
'Screw Type', value_type='str')
screw_speed_kind = api.create_data_field(
'Screw Speed', units='rpm', value_type='float')
ir_spectrum_kind = api.create_data_field(
'IR Spectrum', value_type='fileset:ir')
sem_image_kind = api.create_data_field(
'SEM Image', value_type='fileset:sem')
# Create configurable
temperature_configurable = api.create_configurable(
'Temperature', units='ºC', value_type='float')
tensile_strength_kind = api.create_data_field(
'Tensile Strength',
units='MPa',
value_type='float',
configurables=[temperature_configurable]
)
Creating data groups¶
After all the data fields model are created, they can be grouped into specific data groups. Below is an example:
condition_group = api.create_data_group(
'Condition',
data_fields=[extruder_kind, screw_type_kind, screw_speed_kind]
)
mechanical_properties_group = api.create_data_group(
'Mechanical Properties',
data_fields=[tensile_strength_kind]
)
physical_properties_group = api.create_data_group(
'Physical Properties',
data_fields=[ir_spectrum_kind]
)
# Groups of groups work as well
evaluation_group = api.create_data_group(
'Evaluation',
subgroups=[mechanical_properties_group, physical_properties_group]
)
Creating a data model¶
Finally, the data model can be made by using the objects created above:
data_model = api.create_data_model(
'Edge_Example',
data_fields=[timestamp_kind],
groups=[condition_group, evaluation_group]
)
Editing a data model¶
The functions above allow creation of a data model in a "bottom up" fashion, by creating configurables first, and then data fields, then data groups, and finally the data model. There are also functions to allow a "top down" construction, by editing a data model which has already been constructed
Adding children to a data model/group/field after creation¶
add_configurable
adds a configurable to an already existing data field in a modeladd_data_field
adds a data field to an already existing data group in a modeladd_data_group
adds a data group to an already existing data group in a model
composition_group = api.create_data_group('Composition')
api.add_data_group(
data_model, # Data model to add to
'Edge_Example', # Path to data group within the data model (here, the root node)
composition_group # Data group to add
)
base_kind = api.create_data_field(
'Base', units='%', value_type='float')
flame_retardant_kind = api.create_data_field(
'Flame Retardant', units='%', value_type='float')
api.add_data_field(
data_model, # Data model to add to
'Edge_Example/Composition', # Path to data group within the data model
base_kind # Data field to add
)
api.add_data_field(
data_model, # Data model to add to
'Edge_Example/Composition', # Path to data group within the data model
flame_retardant_kind # Data field to add
)
purity_configurable = api.create_configurable(
'Purity', units='%', value_type='float')
api.add_configurable(
data_model, # Data model to add to
'Edge_Example/Composition/Flame Retardant', # Path to data field within the data model
purity_configurable # Configurable to add
)
Editing the description of a data model¶
edit_data_model_description
edits the data model description
# Edit description
api.edit_data_model_description(
data_model, 'Insulation Manufacturing Data')
Deleting parts of the data model¶
delete_data_field
removes a data field from a modeldelete_data_group
removes a data group from a model
# Delete data field
api.delete_data_field(data_model, ('Edge_Example', 'Evaluation', 'Mechanical Properties', 'Tensile Strength'))
# String as path also works
api.delete_data_field(data_model, 'Edge_Example/Condition/Screw Type')
# Delete data group
api.delete_data_group(data_model, 'Edge_Example/Condition')
Editing a data group¶
edit_data_group edits the name and description of a data group
Note that there is no function for editing a data field other than deleting it.
api.edit_data_group(
data_model, 'Edge_Example/Composition', name='Beethoven',
description='Composer')
Saving the data model¶
Once the model is created, it needs to be saved to Data Catalog using the
EdgeService
instance initialized above.
# Add the data model
service.add_entity_kind(entity_kind)
Saving a model which already exists in the database¶
If you are modifying an existing model, instead of saving a model in the
database for the first time, then to add it back to the database you must use
update_entity_kind
instead of add_entity_kind
:
service.update_entity_kind(data_model)