COVID-19 Open Research Dataset

Note

This page is part of the Dataset Import section, which includes examples of loading popular datasets into YDB. Before starting, please review the general information on requirements and the import process.

An open dataset of COVID-19 research.

Source: Kaggle - COVID-19 Open Research Dataset Challenge

Size: 1.65 GB (metadata.csv file)

Loading Example

  1. Download and unzip the metadata.csv file from Kaggle.

  2. The dataset includes completely identical rows. Since YDB requires unique primary key values, add a new column named row_id to the file, where the key value will be equal to the row number in the original file. This prevents the removal of duplicate data. This operation can be carried out using the awk command:

    awk 'NR==1 {print "row_id," \$0; next} {print NR-1 "," \$0}' metadata.csv > temp.csv && mv temp.csv metadata.csv
    
  3. Create a table in YDB using one of the following methods:

    For more information on Embedded UI.

    CREATE TABLE `covid_research` (
        `row_id` Uint64 NOT NULL,
        `cord_uid` Text NOT NULL,
        `sha` Text NOT NULL,
        `source_x` Text NOT NULL,
        `title` Text NOT NULL,
        `doi` Text NOT NULL,
        `pmcid` Text NOT NULL,
        `pubmed_id` Text NOT NULL,
        `license` Text NOT NULL,
        `abstract` Text NOT NULL,
        `publish_time` Text NOT NULL,
        `authors` Text NOT NULL,
        `journal` Text NOT NULL,
        `mag_id` Text,
        `who_covidence_id` Text,
        `arxiv_id` Text,
        `pdf_json_files` Text NOT NULL,
        `pmc_json_files` Text NOT NULL,
        `url` Text NOT NULL,
        `s2_id` Uint64,
        PRIMARY KEY (`row_id`)
    )
    WITH (
        STORE = COLUMN
    );
    
    ydb sql -s \
    'CREATE TABLE `covid_research` (
        `row_id` Uint64 NOT NULL,
        `cord_uid` Text NOT NULL,
        `sha` Text NOT NULL,
        `source_x` Text NOT NULL,
        `title` Text NOT NULL,
        `doi` Text NOT NULL,
        `pmcid` Text NOT NULL,
        `pubmed_id` Text NOT NULL,
        `license` Text NOT NULL,
        `abstract` Text NOT NULL,
        `publish_time` Text NOT NULL,
        `authors` Text NOT NULL,
        `journal` Text NOT NULL,
        `mag_id` Text,
        `who_covidence_id` Text,
        `arxiv_id` Text,
        `pdf_json_files` Text NOT NULL,
        `pmc_json_files` Text NOT NULL,
        `url` Text NOT NULL,
        `s2_id` Uint64,
        PRIMARY KEY (`row_id`)
    )
    WITH (
        STORE = COLUMN
    );'
    
  4. Execute the import command:

    ydb import file csv --header --null-value "" --path covid_research metadata.csv
    

Analytical Query Example

Run a query to determine the journals with the highest number of publications:

SELECT
    journal,
    COUNT(*) AS publication_count
FROM covid_research
WHERE journal IS NOT NULL AND journal != ''
GROUP BY journal
ORDER BY publication_count DESC
LIMIT 10;
ydb sql -s \
'SELECT
    journal,
    COUNT(*) AS publication_count
FROM covid_research
WHERE journal IS NOT NULL AND journal != ""
GROUP BY journal
ORDER BY publication_count DESC
LIMIT 10;'

Result:

┌───────────────────────────────────┬───────────────────┐
│ journal                           │ publication_count │
├───────────────────────────────────┼───────────────────┤
│ "PLoS One"                        │ 9953              │
├───────────────────────────────────┼───────────────────┤
│ "bioRxiv"                         │ 8961              │
├───────────────────────────────────┼───────────────────┤
│ "Int J Environ Res Public Health" │ 8201              │
├───────────────────────────────────┼───────────────────┤
│ "BMJ"                             │ 6928              │
├───────────────────────────────────┼───────────────────┤
│ "Sci Rep"                         │ 5935              │
├───────────────────────────────────┼───────────────────┤
│ "Cureus"                          │ 4212              │
├───────────────────────────────────┼───────────────────┤
│ "Reactions Weekly"                │ 3891              │
├───────────────────────────────────┼───────────────────┤
│ "Front Psychol"                   │ 3541              │
├───────────────────────────────────┼───────────────────┤
│ "BMJ Open"                        │ 3515              │
├───────────────────────────────────┼───────────────────┤
│ "Front Immunol"                   │ 3442              │
└───────────────────────────────────┴───────────────────┘