Model Configuration Options¶

class
finetune.config.
Settings
(**kwargs)[source]¶ Model configuration options
Parameters:  base_model – Which base model to use  one of {GPT, GPT2, RoBERTa, BERT, TextCNN, TCN}, imported from finetune.base_models. Defaults to GPT.
 batch_size – Number of examples per batch, defaults to 2.
 visible_gpus – List of integer GPU ids to spread out computation across, defaults to all available GPUs.
 n_epochs – Number of iterations through training data, defaults to 3.
 seed – Random seed to use for repeatability purposes, defaults to 42.
 max_length – Maximum number of subtokens per sequence. Examples longer than this number will be truncated (unless chunk_long_sequences=True for SequenceLabeler models). Defaults to 512.
 weight_stddev – Standard deviation of initial weights. Defaults to 0.02.
 chunk_long_sequences – When True, use a sliding window approach to predict on examples that are longer than max length. The progress bar will display the number of chunks processed rather than the number of examples. Defaults to True.
 low_memory_mode – When True, only store partial gradients on forward pass and recompute remaining gradients incrementally in order to save memory. Defaults to False.
 interpolate_pos_embed – Interpolate positional embeddings when max_length differs from it’s original value of 512. Defaults to False.
 embed_p_drop – Embedding dropout probability. Defaults to 0.1.
 attn_p_drop – Attention dropout probability. Defaults to 0.1.
 resid_p_drop – Residual layer fully connected network dropout probability. Defaults to 0.1.
 clf_p_drop – Classifier dropout probability. Defaults to 0.1.
 l2_reg – L2 regularization coefficient. Defaults to 0.01.
 vector_l2 – Whether to apply weight decay regularization to vectors (biases, normalization etc..). Defaults to False.
 optimizer – Optimizer to use, current options include AdamW or AdamaxW.
 b1 – Adam b1 parameter. Defaults to 0.9.
 b2 – Adam b2 parameter. Defaults to 0.999.
 epsilon – Adam epsilon parameter: Defaults to 1e8.
 lr_schedule – Learning rate schedule – see finetune/optimizers.py for more options.
 lr – Learning rate. Defaults to 6.25e5.
 lr_warmup – Learning rate warmup (percentage of all batches to warmup for). Defaults to 0.002.
 max_grad_norm – Clip gradients larger than this norm. Defaults to 1.0.
 shuffle_buffer_size – How many examples to load into a buffer before shuffling. Defaults to 100.
 dataset_size – Must be specified in order to calculate the learning rate schedule when the inputs provided are generators rather than static datasets.
 accum_steps – Number of updates to accumulate before applying. This is used to simulate a higher batch size.
 lm_loss_coef – Language modeling loss coefficient – a value between 0.0  1.0 that indicates how to trade off between language modeling loss and target model loss. Usually not beneficial to turn on unless dataset size exceeds a few thousand examples. Defaults to 0.0.
 tsa_schedule – Training Signal Annealing Schedule from ‘Unsupervised Data Augmentation for Consistency Training’. One of {“linear_schedule”, “exp_schedule”, “log_schedule”}. Defaults to None.
 summarize_grads – Include gradient summary information in tensorboard. Defaults to False.
 val_size – Validation set size if int. Validation set size as percentage of all training data if float. Defaults to 0. If value “auto” is provided, validation will not be run by default if n_examples < 50. If n_examples > 50, defaults to max(5, min(100, 0.05 * n_examples))
 val_interval – Evaluate on validation set after val_interval batches. Defaults to 4 * val_size / batch_size to ensure that too much time is not spent on validation.
 lm_temp – Language model temperature – a value of 0.0 corresponds to greedy maximum likelihood predictions while a value of 1.0 corresponds to random predictions. Defaults to 0.2.
 seq_num_heads – Number of attention heads of final attention layer. Defaults to 16.
 keep_best_model – Whether or not to keep the highestperforming model weights throughout the train. Defaults to False.
 early_stopping_steps – How many steps to continue with no loss improvement before early stopping. Defaults to None.
 subtoken_predictions – Return predictions at subtoken granularity or token granularity? Defaults to False.
 multi_label_sequences – Use a multilabeling approach to sequence labeling to allow overlapping labels.
 multi_label_threshold – Threshold of sigmoid unit in multi label classifier. Can be increased or lowered to trade off precision / recall. Defaults to 0.5.
 autosave_path – Save current best model (as measured by validation loss) to this location. Defaults to None.
 tensorboard_folder – Directory for tensorboard logs. Tensorboard logs will not be written unless tensorboard_folder is explicitly provided. Defaults to None.
 log_device_placement – Log which device each operation is placed on for debugging purposes. Defaults to False.
 allow_soft_placement – Allow tf to allocate an operation to a different device if a device is unavailable. Defaults to True.
 save_adam_vars – Save adam parameters when calling model.save(). Defaults to True.
 num_layers_trained – How many layers to finetune. Specifying a value less than model’s number of layers will train layers starting from model output. Defaults to 12.
 train_embeddings – Should embedding layer be finetuned? Defaults to True.
 class_weights – One of ‘log’, ‘linear’, or ‘sqrt’. Autoscales gradient updates based on class frequency. Can also be a dictionary that maps from true class name to loss coefficient. Defaults to None.
 oversample – Should rare classes be oversampled? Defaults to False.
 params_device – Which device should gradient updates be aggregated on? If you are using a single GPU and have more than 4Gb of GPU memory you should set this to GPU PCI number (0, 1, 2, etc.). Defaults to “cpu”.
 eval_acc – if True, calculates accuracy and writes it to the tensorboard summary files for valudation runs.
 save_dtype – specifies what precision to save model weights with. Defaults to np.float32.
 regression_loss – the loss to use for regression models. One of L1 or L2, defaults to L2.
 prefit_init – if True, fit target model weigths before finetuning the entire model. Defaults to False.
 debugging_logs – if True, output tensorflow logs and turn off TQDM logging. Defaults to False.
 val_set – Where it is neccessary to use an explicit validation set, provide it here as a tuple (text, labels)
 per_process_gpu_memory_fraction – fraction of the overall amount of memory that each visible GPU should be allocated, defaults to 1.0.
 adapter_size – width of adapter module from ‘Parameter Efficient Transfer Learning’ paper, if defined. defaults to ‘None’.
 n_context_embed – Dimensionality of auxiliary info embeddings. Only use if passing ‘default_context’ to the model as well. Defaults to 6 for convolutional models, otherwise 32.