Synchronizing clinical data with a bucket

In this section, we will show how to synchronize research data with a bucket in your organizational dataset. The goal of this step is to gather data from different sources and sort them to obtain a sorted dataset (that we will then validate in the next section).

Not available for Community Projects

View our pricing for more information.

The reference design described in the health reference design consists of 10 subjects performing 1.5 - 2 hours of activities in a research lab. Participants have a study ID (e.g. AMS_001) that is used to refer to the participant. For each participant we have 4 CSV files:

accelerometer.csv - data from the wearable end device.
ppg.csv - data from the wearable end device.
polar_h10.csv - reference data from a commercial reference device (Polar H10).
labels.csv - labels of the activity, as recorded by the research lab.

We've mimicked a proper research study, and have split the data up into two locations.

accelerometer.csv / ppg.csv - live in the company data lake in S3. The data lake uses an internal structure with non-human readable IDs for each participant (e.g. 2E93ZX for anonymized data):
```
7HAIGO
|_ accelerometer.csv
|_ ppg.csv
Z0ZPJW
|_ accelerometer.csv
|_ ppg.csv
```
polar_h10.csv / labels.csv are uploaded by the research partner to an upload portal. The files are prefixed with the study ID:

To create the mapping between the study ID and the internal data lake ID we use a study master sheet. It contains information about all participants, ID mapping, and metadata. E.g.:

Subject	    Internal ID	    Study date	    Age	    BMI
AMS_001	    7HAIGO      	2022-03-10	    24	    18
AMS_002	    Z0ZPJW      	2022-01-27	    35	    31

Notes: This master sheet was made using a Google Sheet but can be anything. All data (data lake, portal, output) are hosted in an Edge Impulse S3 bucket but can be stored anywhere (see below).

Configuring a storage bucket for your dataset

Data is stored in storage buckets, which can either be hosted by Edge Impulse, or in your own infrastructure. If you choose to host the data yourself your infrastructure should be available through the S3 API, and you are responsible for setting up proper backups. To configure a new storage bucket, head to your organization, choose Data > Buckets, click Add new bucket, and fill in your access credentials. Our solution engineers are also here to help you set up the buckets for you.

About datasets

With the storage bucket in place you can create your first dataset. Datasets in Edge Impulse have three layers:

The dataset, a larger set of data items, grouped together.
Data item, an item with metadata and files attached.
Data file, the actual files.

No required format for data files

There is no required format for data files. You can upload data in any format, whether it's CSV, Parquet, or a proprietary data format.

Adding research data to your organization

There are three ways of uploading data into your organization. You can either:

Upload data directly to the storage bucket (recommended method). In this case use Add data... > Add dataset from bucket and the data will be discovered automatically.
Upload data through the Edge Impulse API.
Upload the files through the Upload Portals.

Sorter and combiner

Sorter

The sorter is the first step of the research pipeline. It's job is to fetch the data from all locations (here: internal data lake, portal, metadata from study master sheet) and create a research dataset in Edge Impulse. It does this by:

Creating a new structure in S3 like this:

AMS_001
|_ AMS_001_labels.csv
|_ AMS_001_polar_h10.csv
|_ accelerometer.csv
|_ ppg.csv
AMS_002
|_ AMS_002_labels.csv
|_ AMS_002_polar_h10.csv
|_ accelerometer.csv
|_ ppg.csv

Syncing the S3 folder with a research dataset in your Edge Impulse organization (like AMS Activity Study 2022).
Updating the metadata with the metadata from the master sheet (Age, BMI, etc...).

Combiner

With the data sorted we then:

Need to verify that the data is correct (see validate your research data)
Combine the data into a single Parquet file. This is essentially the contract we have for our dataset. By settling on a standard format (strong typed, same column names everywhere) this data is now ready to be used for ML, new algorithm development, etc. Because we also add metadata for each file here we're very quickly building up a valuable R&D datastore.

All these steps can be run through different transformation blocks and executed one after the other using data pipelines.

PreviousHealth Reference Design NextValidating clinical data

Last updated 2 months ago