Neural Data Server (NDS) is a search engine for A.I. data. It uses Machine Learning to find the most relevant data to pre-train your deep neural models for your target application. Pre-training is known to significantly boost performance of a deep neural network particularly in cases where the amount of labeled data for the target application is scarce. However, there are tens of millions of labeled examples available across hundreds of existing datasets. Using all this data for pre-training is prohibitive: massive computational resources would be required particularly when the neural architecture for your application is still being developed and possibly many need to be tested. Finding a smaller, manageable amount of relevant data for pre-training is a crucial issue that NDS addresses.
Neural Data Server indexes several popular Machine Learning datasets in its quest to serve optimal data to ML users. In order to perform search, we provide a set of simple-to-use Machine Learning tools that probe your target application data (which is performed on the user's end) and send minimal information back to NDS in order to decide which data examples are relevant. Note that user's data is NOT visible to NDS, i.e., no user's data is at any point being copied to NDS. NDS finally generates a file with url links to relevant data samples. The user can then download the data (after agreeing to licenses of the datasets) and is ready to train a powerful neural network for their application!
Watch the short video below on how Neural Data Server works. For further technical information about NDS please see our paper.
We will be continuously growing our Dataset Registry, so please check back for updates. If you own a Machine Learning dataset and wish to add it to our Dataset Registry, please Contact Us.
The principle on how NDS operates is simple: We provide the user with a few classifiers that represent clusters of indexed data in our Dataset Registry. We employ these classifiers to determine their performance on the user's dataset. This computation is performed on the user's end to ensure privacy of her/his data. These performances are sent back to NDS, and NDS samples data according to them. NDS finally generates a file containing IDs/urls of the relevant data examples from the Dataset Registry. We do not host datasets, rather we provide IDs of samples hosted by the dataset providers. The user can specify the budget of data (number of examples) she or he wants. These IDs can then be used to fetch the most relevant data from the provider's website, subject to the respective dataset license agreement. Advice on how to use this data for pre-training is provided here.
To use the NDS Search Engine: (1) First, use our script to adapt to your dataset and generate a transfer.pickle file for communicating with our server. (2) Next, upload the transfer.pickle file, and click download to obtain a JSON file containing the IDs of the most relevant data subset for your application. Please see our Guide for an instruction video on how to use our platform.
We list the datasets we currently support below. We currently focus on visual datasets with support for other modalities coming soon.
If you wish to adapt to your dataset via our Local Option (see 2), hover over the datasets you wish to search over and click Download Expert to transfer the pre-trained models used in the script to your local machine. Otherwise, you can jump to 2 and use Google colab to generate the transfer.pickle file for communicating with our server.
* Licenses presented are for each dataset. We make no warranties regarding the license status of images in each dataset and you should verify the license for each image yourself. Please contact each dataset’s original provider for any questions/licensing.
Contribute: Help grow NDS dataset registry by registering your dataset. Please see instructions.
Upload the transfer.pkl file generated by the fast adapt script and download the most relevant subset. Please specify the number of data examples you want in the "Budget" box.
Force Directed Graph of dataset clusters, showing transfer performance measured from fast adaptation. The centre node represents your target dataset, the other nodes represent each expert trained on a particular source dataset cluster. Closer nodes imply higher relevancy to your dataset.
If you find NDS useful, please consider citing our paper.