The novel coronavirus, COVID-19, presents challenges the world hasn’t seen for decades. Humans have fought global pandemics before, and it isn’t easy. But we have an additional weapon on our side this time — data.
Data helps researchers understand the spread of the disease, how it is transmitted, and the rate at which transmissions occur from the initial infection. Data is invaluable in defeating this virus. But researchers face unique challenges in working with health care data. New files are added to public health databases each day. Getting them aggregated is one big challenge, and after files are joined together, the data must be matched up to ensure time series are accurate (e.g. incidents by date, or by location). In many cases, data must be cleaned up as well.
Health care data professionals and other researchers need data of the highest quality and accuracy. In addition, the cleaner and more accurate the datasets are, the faster they are to ingest and work with.
That’s where Talend steps in.
What’s currently in the COVID-19 datasets?
In collaboration with developers from the Singer open source community, a joint team from Talend and Bytecode has created a tool to ETL COVID-19 datasets. We standardize the data, augment it with metadata, then route the results to a data warehouse or data lake: Amazon Redshift, Amazon S3, Snowflake, Microsoft Azure Synapse Analytics, Delta Lake for Databricks, or Google BigQuery. Data engineers and scientists can run the tool on their own infrastructure or use Stitch for free.
The COVID-19 integration covers several datasets:
- Johns Hopkins CSSE Data
- EU Data
- Italy Data
- NY Times US Data
- Neher Lab Scenarios Data
- COVID-19 Tracking Project
The data stored in these repositories lacks a common format. For instance, the EU Data comprises data from different countries, and the header names for the same type of data differ. Even slight changes like these require data professional take extra time and steps to cleanse and standardize data. Having these datasets processed through our ETL gives users guaranteed consistency for this data so they can focus more on their models or visualizations and make faster and more confident decisions.
How the COVID-19 dataset works
The tap utilizes the GitHub V3 API library to query and retrieve files stored in multiple GitHub repositories. Users must get a GitHub token, which allows the tap to increase the number of API calls. Users can then select one or all of the supported datasets and the fields associated with them, select one of the Stitch destinations, and select the frequency of the loads. Given that the data is typically updated more than once a day, we suggest a frequency of every 6 to 12 hours, but you can choose more frequent replication.
How to access and explore the COVID-19 integrated dataset
These datasets should be beneficial to anyone doing health research. Interested researchers can run the data import for free on the Stitch platform. Here are all the options for accessing the data or joining the effort to further build out this dataset.
GitHub repo: https://github.com/singer-io/tap-covid-19
Singer tap: https://www.singer.io/tap/covid-19-public-data/
Stitch integration: https://www.stitchdata.com/integrations/covid-19/
Finally, the entire dataset is available in a read-only Redshift data warehouse, courtesy of AWS, found at covid-19-public-data.c4ft0fualwvc.us-east-2.redshift.amazonaws.com – access with the following credentials:
database: covid_19 | schema: covid_19 | user: publicuser | password: Covid-19