High-quality data is the oil that moves the AI wheel to new heights, and the machine learning community cannot get enough of it. Organisations and individuals working on disruptive technologies like AI need datasets to fuel up their ML, Deep Learning and NLP algorithms.
2018 saw an unprecedented number of newly open-sourced datasets, including Stanford University’s Q&A dataset Hotpot, UC Berkeley’s large-scale self-driving dataset BDD100K and Google’s Open Images V4.
To keep you abreast with the latest trends in the open source data here is our pick of the free public data sources for 2019-
Facial Diversity Stakes Rise with IBM
In a 2018 research, IBM performed poorly when it was assigned the task to identify dark-skinned female subjects with only 65.3% accuracy. Since then the tech giant has pulled up its socks to improve its research efforts against AI bias with releasing the world largest facial attribute dataset called Diversity in Faces (DiF).
DiF comprises one million human facial images which were compiled from the YFCC-100M Creative Commons dataset. IBM says its aim is to reduce bias and advance impartiality in facial recognition technology.
CheXpert Dataset Announced by Landing.ai Founder Andrew Ng
In January 2019, Stanford University researchers led by Landing.ai Founder Andrew Ng announced the CheXpert, a large dataset of chest X-rays designed for automated interpretation. The Stanford Machine Learning group is of the opinion that deep learning could automatically detect chest abnormalities at the human-expert level. The CheXpert data set contains 224,316 chest radiographs collated from 65,240 patients. This data was collected from Stanford Hospital chest radiographic examinations which were performed between 2002 and 2017 at its inpatient and outpatient centers, along with associated radiology reports. Andrew Ng’s research group developed an automatic labeler to translate observations into three structured labels comprising positive, negative, or uncertain outcomes.
Natural Questions Dataset Introduced by Google Research
Google Research announced the Natural Questions dataset to drive NLP research providing end-to-end training data for question-answering research problems. This dataset consists of over 300,000 question-answer pairs which were collected from real anonymized and aggregated queries issued to the Google search engine.
Facebook’s BISON Dataset
Facebook’s latest BISON dataset focuses on fine-grained visual grounding. Given a caption description with two images in BISON, the system would choose an image that best matches the caption between the two images. This dataset was compiled to complement the COCO Captions dataset. The BISON-COCO is not a training dataset, but rather an evaluation dataset that can be deployed to check existing models’ ability to pair visual content with the appropriate text descriptions.
Stanford Dataset Focuses Question Answering on Real World Images
In 2018, Stanford NLP put heavy weight into its open-data development in question answering with Hotpot, CoQA and SQuAD 2.0 datasets. In a latest development, Stanford NLP Group led by Christopher Manning came up with GQA, a dataset mapped to visual reasoning and compositional question answering based on the real-world images. Visual question answering is an important AI subfield that involves building models to answer questions based on the visual content using natural language. Existing datasets in the public domain include VQA, which contains open-ended questions about images. The Stanford Dataset contains 20 million questions which are paired with various images, each of which is associated with a scene graph of the image’s attributes, object and relations.
Google’s Free Open Data Sets
There are other free open datasets that have been doing rounds, until early September 2018, Google search didn’t include metadata search for datasets. After its acquisition of Schema.org, the metadata for datasets is now recognized by Google’s knowledge graph, find it here-
Google staff is of the opinion that they have already indexed more than a million items that appear to be datasets with some refinements already available. Here’s a subsidiary search site just for truly public datasets: https://cloud.google.com/public-datasets/
This page will also lead you to some special subsets like the Google BigQuery Public Datasets (the first terabyte download is free but charges apply after that), Geo-Imagery Datasets and Google Genomics Public Datasets.
To beat the competition, Microsoft recently launched a similar site tagged Microsoft Research Open Data, in beta version.
MS Research Open Data does not search the entire web, but rather makes available 53 proprietary datasets all in the realm of deep learning, in both text/speech and image formats.
Academic Torrents offer just fewer than 2,000 datasets which total about 28 terabytes. Academic Torrents is a distributed system to share very large datasets covering a very eclectic range of topics. Academic Torrents offer datasets that are searchable, but perhaps not as comprehensive as the Google site. In addition to downloading, you can also upload your dataset for community use to this site.
Skymind is a commercial platform through which you can rapidly maintain, prototype, deploy and retrain machine learning models. Skymind offers 101 datasets from a variety of sources that cover Text, Question answering, Video, Sentiment, Speech Datasets, Symbolic Music, Health & Biology, Recommendation, Natural Image, Geospatial, Facial, Networks and Graphs, and Government & statistical data sets.
Kaggle / Federal Government Sources and Github
These are the tried and true traditional free data sources:
• Github: 565 datasets.
• Kaggle Public Datasets: 10,992 current listings.
• Gov: The home of the US Government’s open data and houses currently 302,944 datasets.
We hope that these open source data sources will be useful for you in 2019.