birdnet_tiny_forge.pipelines.data_preprocessing package

Submodules

birdnet_tiny_forge.pipelines.data_preprocessing.nodes module

Nodes to be used in data preprocessing pipeline

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.decide_splits(df, test_size=0.2, val_size=0.1, random_state=42)

Given dataset metadata populate the split information.

Parameters:
  • df – metadata dataframe

  • test_size – fraction of dataset used for testing

  • val_size – fraction of dataset used for validation

Returns:

metadata dataframe containing split info

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.extract_loudest_slice(audio_clips, audio_slice_duration_ms)

This node uses Kedro’s lazy loading idiom of passing a dictionary of callables which load the actual data. For each callable, it extracts a slice of audio_slice_duration_ms containing the max of the recording. It doesn’t perform the operation straight away, but it creates a new callable (so we stay lazy and the processing is only done one exactly when it’s time to save the data).

Parameters:
  • audio_clips – a dictionary of callables returning audio, sample rate.

  • audio_slice_duration_ms

Returns:

dictionary of callables returning sliced audio

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.extract_metadata(audio_slices)

Return a pandas dataframe of metadata for each audio slice. This includes its original path, and a label inferred from its path.

Parameters:

audio_slices – a dictionary of callables returning sliced audio

Returns:

dictionary of callables returning sliced audio

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.plot_slices_sample(audio_slices, n_slices)

Plot a few slices of data as a plotly figure

Parameters:
  • audio_slices – a dictionary of callables returning sliced audio

  • n_slices – number of slices to plot

Returns:

plotly figure

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.plot_splits_info(audio_slices_metadata: DataFrame)

Plot split counts broken down by clip class, into a plotly figure

Parameters:

audio_slices_metadata – metadata for the audio slices dataset

Returns:

plotly figure

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.save_labels_dict(audio_slices_metadata: DataFrame)

Create dictionary mapping audio labels to sequential integers

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.slices_filter_short(audio_slices, audio_slice_duration_ms)

Filter out slices that are smaller than audio_slice_duration_ms (and filter out files that can’t be opened)

Parameters:
  • audio_slices – a dictionary of callables returning sliced audio

  • audio_slice_duration_ms

birdnet_tiny_forge.pipelines.data_preprocessing.nodes.slices_make_canonical(audio_slices, sample_rate, subtype)

Make sure all slices have the same sample rate, number of channels, etc. This fn uses the kedro idiom of taking a dictionary of callables and returning a dictionary of callables to allow for lazy processing of data.

Parameters:
  • audio_slices – a dictionary of callables returning sliced audio

  • sample_rate

  • subtype – see soundfile’s documentation for details

Returns:

dictionary of callables returning canonical-ized sliced audio

birdnet_tiny_forge.pipelines.data_preprocessing.pipeline module

This pipeline performs pre-processing on audio data, as well as deciding early which audio data belongs to which train/test/validation split

birdnet_tiny_forge.pipelines.data_preprocessing.pipeline.create_pipeline(**kwargs) Pipeline