What is a neural network?
It’s a technique for building a computer program that learns from data, based very loosely on how brains work. Software “neurons” are connected together, allowing them to send messages to each other. The network is asked to solve a problem, which it attempts to do over and over, each time strengthening the connections that lead to success and diminishing those that lead to failure. For a more detailed introduction to neural networks, Michael Nielsen’s Neural Networks and Deep Learning is a good place to start.
More about the datasets
The data points are scaled and dimensionless, so it might be hard to tell what they represent. So here's a brief description.
Classification datasets
Linear – A data set that is linearly seperable along the X1 dimension. You'll only need one feature and one neuron to learn this relationship.
Diagonal – The two classes are clustered along a diagonal, so you'll need more than one feature to build an adequate classifier. Note that the training data does not span the domain of the data, whereas the validation data does — so the variance is always high on this dataset.
Exclusive Or (XOR) – This is a classic problem in neural network research. It is the simplest non-linearly separable classification task that exists.
Gaussian – A classification problem with clusters (or blobs) each represented by a Gaussian distribution. Change the noise level to broaden or narrow the distribution.
Circle – The instances of one class encircle the instances of the other.
Spiral – Two classes spiraling around each other.
Moons – Two interleaving moons, or croissants if you prefer. Inspired by the sklearn
dataset.
Real data: sand vs shale – This is P-wave velocity (X1) and bulk density (X2) values for a selection of sandstones (cyan) and shales (dark blue) from the Rock Property Catalog. A standard scalar has been applied to the features, then multipled by 2.
Real data: poro-perm – Porosity (X1) and the base-10 logarithm of permeability (X2) of Oligocene sandstones. Medium-to-coarse grained sands are in cyan, fine-to-medium grained sands are in dark blue. Dataset is modified from Taylor et al. 1993, dataset number 64 in the USGS report 03-420 of Porosity and Permeability from Core Plugs in Siliclastic Rocks.
Regression datasets
Plane – Data coordinates sampling a dipping planar surface.
Multi-Gaussian – multiple clusters with Gaussian spatial distributions.
Real data: Porosity – This is a small dataset representing a porosity map. It is from Geoff Bohling at the Kansas Geological Survey, but we can no longer find the data online.
Real data: DTS from DTP and RHOB – Predicting S-wave sonic wireline measurements from compressional velocity (DTP) (X1) and bulk density (X2). Data has been downsampled from well R-39 offshore Nova Scotia, available from the CNSOPB.
How do I make my own datasets?
You will need to define a machine learning task with 2 features and one target. About 400 records is ideal — split 50% positive and 50% negative classes for a classification problem. The features will be scaled using Z-score standardization (multiplied by 2 for complicated reasons). The target should be -1 and 1 for classification, or in the range [-1, 1] for regression. Here's an example of how to prepare data.
Once you have a JSON file, you can upload it using the button with the file_upload icon (top left).