Model Configuration Options

class finetune.config.Settings(**kwargs)[source]

Model configuration options

Parameters:
  • batch_size – Number of examples per batch, defaults to 2.
  • visible_gpus – List of integer GPU ids to spread out computation across, defaults to all available GPUs.
  • n_epochs – Number of iterations through training data, defaults to 3.
  • random_seed – Random seed to use for repeatability purposes, defaults to 42.
  • max_length – Maximum number of subtokens per sequence. Examples longer than this number will be truncated (unless chunk_long_sequences=True for SequenceLabeler models). Defaults to 512.
  • weight_stddev – Standard deviation of initial weights. Defaults to 0.02.
  • chunk_long_sequences – When True, use a sliding window approach to predict on examples that are longer than max length. Defaults to False.
  • low_memory_mode – When True, only store partial gradients on forward pass and recompute remaining gradients incrementally in order to save memory. Defaults to False.
  • interpolate_pos_embed – Interpolate positional embeddings when max_length differs from it’s original value of 512. Defaults to False.
  • embed_p_drop – Embedding dropout probability. Defaults to 0.1.
  • attn_p_drop – Attention dropout probability. Defaults to 0.1.
  • resid_p_drop – Residual layer fully connected network dropout probability. Defaults to 0.1.
  • clf_p_drop – Classifier dropout probability. Defaults to 0.1.
  • l2_reg – L2 regularization coefficient. Defaults to 0.01.
  • vector_l2 – Whether to apply weight decay regularization to vectors (biases, normalization etc..). Defaults to False.
  • optimizer – Optimizer to use, current options include AdamW or AdamaxW.
  • b1 – Adam b1 parameter. Defaults to 0.9.
  • b2 – Adam b2 parameter. Defaults to 0.999.
  • epsilon – Adam epsilon parameter: Defaults to 1e-8.
  • lr_schedule – Learning rate schedule – see finetune/optimizers.py for more options.
  • lr – Learning rate. Defaults to 6.25e-5.
  • lr_warmup – Learning rate warmup (percentage of all batches to warmup for). Defaults to 0.002.
  • max_grad_norm – Clip gradients larger than this norm. Defaults to 1.0.
  • accum_steps – Number of updates to accumulate before applying. This is used to simulate a higher batch size.
  • lm_loss_coef – Language modeling loss coefficient – a value between 0.0 - 1.0 that indicates how to trade off between language modeling loss and target model loss. Usually not beneficial to turn on unless dataset size exceeds a few thousand examples. Defaults to 0.0.
  • summarize_grads – Include gradient summary information in tensorboard. Defaults to False.
  • val_size – Validation set size if int. Validation set size as percentage of all training data if float. Validation will not be run by default if n_examples < 50. If n_examples > 50, defaults to max(5, min(100, 0.05 * n_examples))
  • val_interval – Evaluate on validation set after val_interval batches. Defaults to 4 * val_size / batch_size to ensure that too much time is not spent on validation.
  • lm_temp – Language model temperature – a value of 0.0 corresponds to greedy maximum likelihood predictions while a value of 1.0 corresponds to random predictions. Defaults to 0.2.
  • seq_num_heads – Number of attention heads of final attention layer. Defaults to 16.
  • subtoken_predictions – Return predictions at subtoken granularity or token granularity? Defaults to False.
  • multi_label_sequences – Use a multi-labeling approach to sequence labeling to allow overlapping labels.
  • multi_label_threshold – Threshold of sigmoid unit in multi label classifier. Can be increased or lowered to trade off precision / recall. Defaults to 0.5.
  • autosave_path – Save current best model (as measured by validation loss) to this location. Defaults to None.
  • tensorboard_folder – Directory for tensorboard logs. Tensorboard logs will not be written unless tensorboard_folder is explicitly provided. Defaults to None.
  • log_device_placement – Log which device each operation is placed on for debugging purposes. Defaults to False.
  • allow_soft_placement – Allow tf to allocate an operation to a different device if a device is unavailable. Defaults to True.
  • save_adam_vars – Save adam parameters when calling model.save(). Defaults to True.
  • num_layers_trained – How many layers to finetune. Specifying a value less than 12 will train layers starting from model output. Defaults to 12.
  • train_embeddings – Should embedding layer be finetuned? Defaults to True.
  • class_weights – One of ‘log’, ‘linear’, or ‘sqrt’. Auto-scales gradient updates based on class frequency. Can also be a dictionary that maps from true class name to loss coefficient. Defaults to None.
  • oversample – Should rare classes be oversampled? Defaults to False.
  • params_device – Which device should gradient updates be aggregated on? If you are using a single GPU and have more than 4Gb of GPU memory you should set this to GPU PCI number (0, 1, 2, etc.). Defaults to “cpu”.
  • eval_acc – if True, calculates accuracy and writes it to the tensorboard summary files for valudation runs.
  • save_dtype – specifies what precision to save model weights with. Defaults to np.float32.
  • regression_loss – the loss to use for regression models. One of L1 or L2, defaults to L2.
  • prefit_init – if True, fit target model weigths before finetuning the entire model. Defaults to False.
  • debugging_logs – if True, output tensorflow logs and turn off TQDM logging. Defaults to False.
  • val_set – Where it is neccessary to use an explicit validation set, provide it here as a tuple (text, labels)
  • per_process_gpu_memory_fraction – fraction of the overall amount of memory that each visible GPU should be allocated, defaults to 1.0.
  • adapter_size – width of adapter module from ‘Parameter Efficient Transfer Learning’ paper, if defined. defaults to ‘None’.