Dataset Search and Augmentation

Auctus is an open-source dataset search engine that was designed to support data discovery and augmentation. Users (and systems) can pose a rich set of discovery queries: in addition to keyword-based search, they can specify spatial and temporal queries, and data augmentation queries (i.e., searching for datasets that can be concatenated to or joined with a query dataset). To support these queries, Auctus uses a data profiler that we developed to automatically extract useful information from datasets, including summaries (or sketches) of column contents and their data types.

Number of indexed datasets in the public Auctus instance: 20,255

Socrata: 18,015 (46 different domains including cityofnewyork.us, medicare.gov, sfgov.org, novascotia.ca)
Zenodo “covid”: 1,040 (datasets matching the query term “covid”)
Indicators from University of Arizona: 1,094
Indicators from World Bank: 20
Direct upload: 86

More Information:

The ISI Datamart project is building technology to create the largest publicly available knowledge graph to power data-driven models in a wide variety of domains. At the core of Datamart is Wikidata, a publicly available knowledge graph that already contains over 93 million entities. Datamart will enable communities of interest to build satellite knowledge graphs that contain detailed knowledge in domains of interest. The enabler technologies includes an architecture for combining public data with private data kept within an organization's firewall, and tools to automate the ETL process required to syntactically and semantically align the data in millions of spreadsheets and CSV files to the Wikidata semantic representation.

The ISI Datamart project webpage is here.
The ISI Datamart dataset metadata schema and data schema are defined here.
The CSV files downloaded from Datamart are semantically aligned with Wikidata, and they are in canonicalized CSV format to facilitate joins. The ISI Datamart API is defined here. This Jupyter notebook demonstrates how to use the the API.
The T2WML tool provides a graphical user interface to annotate tabular data to align with Wikidata.
ISI Datamart Fact Sheet

Dataset Search and Augmentation

Data-Driven Discovery of Models