birdnet_tiny_forge.pipelines.data_preprocessing package¶
Submodules¶
birdnet_tiny_forge.pipelines.data_preprocessing.nodes module¶
Nodes to be used in data preprocessing pipeline
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.decide_splits(df, test_size=0.2, val_size=0.1, random_state=42)¶
Given dataset metadata populate the split information.
- Parameters:
df – metadata dataframe
test_size – fraction of dataset used for testing
val_size – fraction of dataset used for validation
- Returns:
metadata dataframe containing split info
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.extract_loudest_slice(audio_clips, audio_slice_duration_ms)¶
This node uses Kedro’s lazy loading idiom of passing a dictionary of callables which load the actual data. For each callable, it extracts a slice of audio_slice_duration_ms containing the max of the recording. It doesn’t perform the operation straight away, but it creates a new callable (so we stay lazy and the processing is only done one exactly when it’s time to save the data).
- Parameters:
audio_clips – a dictionary of callables returning audio, sample rate.
audio_slice_duration_ms –
- Returns:
dictionary of callables returning sliced audio
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.extract_metadata(audio_slices)¶
Return a pandas dataframe of metadata for each audio slice. This includes its original path, and a label inferred from its path.
- Parameters:
audio_slices – a dictionary of callables returning sliced audio
- Returns:
dictionary of callables returning sliced audio
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.plot_slices_sample(audio_slices, n_slices)¶
Plot a few slices of data as a plotly figure
- Parameters:
audio_slices – a dictionary of callables returning sliced audio
n_slices – number of slices to plot
- Returns:
plotly figure
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.plot_splits_info(audio_slices_metadata: DataFrame)¶
Plot split counts broken down by clip class, into a plotly figure
- Parameters:
audio_slices_metadata – metadata for the audio slices dataset
- Returns:
plotly figure
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.save_labels_dict(audio_slices_metadata: DataFrame)¶
Create dictionary mapping audio labels to sequential integers
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.slices_filter_short(audio_slices, audio_slice_duration_ms)¶
Filter out slices that are smaller than audio_slice_duration_ms (and filter out files that can’t be opened)
- Parameters:
audio_slices – a dictionary of callables returning sliced audio
audio_slice_duration_ms –
- birdnet_tiny_forge.pipelines.data_preprocessing.nodes.slices_make_canonical(audio_slices, sample_rate, subtype)¶
Make sure all slices have the same sample rate, number of channels, etc. This fn uses the kedro idiom of taking a dictionary of callables and returning a dictionary of callables to allow for lazy processing of data.
- Parameters:
audio_slices – a dictionary of callables returning sliced audio
sample_rate –
subtype – see soundfile’s documentation for details
- Returns:
dictionary of callables returning canonical-ized sliced audio
birdnet_tiny_forge.pipelines.data_preprocessing.pipeline module¶
This pipeline performs pre-processing on audio data, as well as deciding early which audio data belongs to which train/test/validation split
- birdnet_tiny_forge.pipelines.data_preprocessing.pipeline.create_pipeline(**kwargs) Pipeline¶