`_. Within the folder, there is also an example for the input files ``input.dat`` and a training data set ``train.dat``, as well as an example file for a submission script to the `slurm queueing system `_ ``job_NN.slurm``. Hard-coded variables """"""""""""""""""""" The hard-coded variables can be found in the header file ``nn-class.h``. The hard-coded variables include: :MAX_NODES: largest value of a hidden layer dimension + 1 (default is 51) Furthermore, the following variable should be adjusted as needed: :goodrange[2]{1,1}: scaling range for input and output values; this is an absolute value; the absolute value of the range has to be the same for input and output There is currently one activation functions implemented, 1/(1+exp(x)); again this is hard-coded and other activation functions can be added as needed. Input and output files ^^^^^^^^^^^^^^^^^^^^^^^^ Main input file """"""""""""""""" The main input file contains the basic set-up; an example file ``input.dat`` is provided together with the code and shown in the following: .. code:: console ------------------------------------------------------------------------------------ Input File for neural network fit ------------------------------------------------------------------------------------ # initial seed for random number generator (>0) # if seed <= 0 a random seed between 1 and 1000 is generated ran_seed 123 # network architecture hidden layers 2 ! number of hidden layers input 14 ! number of input features hidden nodes 25 25 ! number of nodes in each hidden layer (one number for each layer) output 5 ! number of outputs # parallel threads nthreads 8 # simulation setup run option 0 ! 1=train with new weights, -1=train with old weight; 0=test job iterations 250 ! maximum number of training iterations optimizer max 200 ! maximum number of optimizer steps save frequency 1 ! how often the results are saved train input train.dat ! name of file containing training data test input test_A15.dat ! name of file containing test data test output output.dat ! name of file for test output test error error.dat ! name of file for test error result (misclassified) weights input weights_old.dat ! name of file containing previous weights weights output weights.dat ! name of file to print new fitted weights cost output cost.dat ! name of the file to save values of cost function file fitting !----------------------------------------------------------------------------------- ! FORMAT OF THE INPUT FILE: ! the first 3 lines are simply read in as string ! the rest is parsed and identified over a string ! !IMPORTANT! The first 16 (!) characters are reserved for the key word!! ! after this different parameter in one line must be separated by at least on space ! lines starting with '!' or '#' are comments The first 3 lines are simply read in as string and ignored. The first 16(!) characters of each line are reserved for the keyword and should not be changed; the values given for each keyword are free format; if several values are required, they need to be separated by spaces. The order of the input lines can be changed. :cost output: name of the output file containing the values of the cost function during the fit; (default: cost.dat) :hidden layers: Number of hidden layers in the NN; this needs to be specified before the other parameters of the network architecture; (default: -1) :hidden nodes: number of nodes in each hidden layer; provide one number of each layer separated by spaces; :input: number of input features; (default: -1) :iterations: number of optimisation steps; each step goes over all training points; (default: 1) :nthreads: number of OMP parallel threads; (default: 1) :optimizer max: maximum number of optimizer/gradient steps in each iteration; (default: 1) :output: number of output nodes; (default: -1) :ran_seed: Seed to initialize the random number generator; random numbers are used to initialize the weights of the NN in fitting; if a value smaller than 0 is chosen, a random seed between 1 and 1000 is set by the programme; (default: 42) :run option: determines simulation task: 0 - read in weights file and evaluate test data; 1 - train NN with randomly initialized weights; -1 - train NN with weights read in from weights file; (default: 0) :save frequency: how often the fitted weights are saved to an output file; (default: 1) :test error: name of the file containing the misclassified test data; (default: error.dat) :test input: name of the input file containing the test data; :test output: name of the file containing the classification results of test data; (default: output.dat) :train input: name of the input file containing the training data; :weights input: name of input file containing the weights; (default: weights_old.dat) :weights output: name of the output file containing the fitted weights; (default: weights.dat) Training and test data """""""""""""""""""""""""" The format of the file containing the training or test data is given by the number of input and output nodes. In general, there is a line for each data point which contains the values of the input functions and corresponding output values. For classification, the output values are either 0 or 1. The number of columns is thus equal to the number of input + number of output nodes: e.g., if there are *x* input functions and *y* output values, then the first *x* columns will be the values of these input functions for a given configuration, and the next *y* columns the values of the corresponding output vector. In the example provided, the training and test data have 14 input functions and 5 output values as detailed in this `publication `_. Weights file """"""""""""" The weights file contains the optimized weights and information concerning the architecture of the NN: - line 1: number of hidden layers - line 2: NN architecture; number of nodes in input, hidden, and output layers - line 3: translation and scaling for the input and output values - line 4: number of fitting iterations - line 5 and beyond: weights Running the code ^^^^^^^^^^^^^^^^^^^^ After the code has been compiled, it is executed by calling .. code:: console ./ml-classification.exe input.dat > screen.out where ``input.dat`` is the name of the main input file. The code will run in parallel with the number of threads defined in the `input file

`_. In the command line above, the screen output is redirected to ``screen.out``. .. In addition, the code writes two files, the ``weight.txt`` file and a file where the name is composed of the name of the input training data and the size of the hidden layers. For example, if the input file is *Training.txt* and the NN has 2 hidden layers with 25 nodes each, the file is called ``Training_25_25_cost.txt``. .. :Training_25_25_cost.txt: contains the cost function for each iteration; the meaning of the magnitude of the cost function is not entirely clear here, but it does need to converge to a constant value; this might not always be obvious to see .. :screen.out: information about the architecture of the NN and running parameters, and again the convergence of the cost function If the training is successful and the weights are converged, the file ``weights.dat`` (or whichever name was set for this file in the input) can be used to apply the NN for the classification of unknown environments. Output files """"""""""""" In **training mode** (1 or -1), the code writes a weights file (default ``weights.dat``, see `Weights file`_), a cost function file (default ``cost.dat``), and prints some additional information to the screen. The cost function file contains the training loss and can be used to monitor the convergence of the fit. In **testing mode** (0), the code writes an output and error file containing the predicted and true values of the test data. The output of the NN is a continuous value :math:`x \in [0,1]` for each component (class) of the output layer. For the test data, the true value is strictly either 0 or 1. To compare the predicted values with the true values in the test file, the code checks if all components that should be 0 are :math:`< 0.5` AND if the component that should be 1 is :math:`> 0.5`. If this is not the case, this data point is considered as wrongly predicted by the NN. The L1 accuracy is defined here as 1-(wrong_points/total_points). :screen.out: standard output contains information about the architecture of the NN and running parameters; the L1 accuracy of the predicted classification with respect to the test set :output.dat: for all test data points, the first *x* columns are the values of the *x* input functions, the next *y* columns are the predicted output values of the NN with *y* output nodes, and the last *y* columns are the actual values from the training set :error.dat: same format as *output.txt*, but here only the points are recorded for which the NN failed to predict the correct class as defined above Test examples ^^^^^^^^^^^^^^^^^^^^ To test the quality of a fitted NN for classification, the predicted values can be compared to known values for a given set of data points (e.g. different local structural environments). In the folder ``testdata``, there are several input files for test data in various structural environments: ``test_BCC.txt``, ``test_FCC.txt``, ``test_HCP.txt``, ``test_A15.txt``, ``test_LIQ.txt``, and ``test_INT.txt`` within a body-centred cubic (BCC), face-centred cubic (FCC), hexagonal close-packed (HCP), A15, liquid (LIQ), and BCC/A15 interface (INT) structure (see also this `publication `_). To run a test on any of these test data sets, provide a weight file, set the run option to 0, and set the name of the test input to any of the provided files. .. warning:: Make sure that the architecture (input, output, hidden layers) used with the given weights file is the same as used for training the neural network! Compilation ^^^^^^^^^^^^^^^^^^^^ The NN fitting and testing code can be used either with `Intel MKL `_ or `OpenBLAS `_ which needs to be installed on the system the code is supposed to run on. (New York University cluster specific setup is explained :ref:`here `) Linux machines """"""""""""""" Using Intel MKL ++++++++++++++++++ Make sure `Intel MKL `_ is installed and environment variables (such as ``MKLROOT``) are set correctly. Then the provided ``Makefile`` can be used and the code is compiled by .. code:: console make which will produce an executable ``ml-classification.exe``. This executable will run in parallel with the number of threads given in the input file. Using OpenBLAS +++++++++++++++++++++ Make sure `OpenBLAS `_ is installed and environment variables (such as ``BLAS_LIB``) are set correctly. Then the provided ``Makefile`` can be used and the code is compiled by .. code:: console make OBLAS=1 which will produce an executable ``ml-classification.exe``. This executable will run in parallel with the number of threads given in the input file. Mac """" Using OpenBLAS +++++++++++++++++++++ If `OpenBLAS `_ is not installed, this can, for example, be done using `MacPorts `_. Usually, the corresponding ``INCLUDE`` and ``LDLIBS`` path should then be .. code:: console /opt/local/include /opt/local/lib If this is not the case, the ``Makefile`` needs to be adapted correspondingly. The code can then be compiled using .. code:: console make MAC_OBLAS=1 which will produce an executable ``Machine_Learning.exe``. This executable will run in parallel with the number of threads given in the input file.