Home - Case Studies

Case Studies


" Utilising AI/ML for Enhanced Wildlife Conservation Analytics "


The content discusses the importance of wildlife conservation in maintaining ecosystem balance and supporting human survival. Despite efforts like wildlife sanctuaries and legal protections, activities such as poaching continue to threaten endangered species. In India, organizations like the Wildlife Protection Society have documented cases of tiger and leopard poaching. Industrialization and deforestation further endanger wildlife, prompting predictions of a significant disparity between human and animal populations by 2050. The use of modern technology, specifically machine learning, is proposed as a tool to monitor and protect wildlife more effectively. Systems incorporating sound sensors and machine learning algorithms aim to differentiate between animal and human activity in wildlife habitats, offering real-time monitoring and data analysis for conservation purposes.

Research Summary

  • 1.The proposed research endeavour aims to train a machine learning model using audio data primarily consisting of elephant vocalizations.
  • 2.Elephant rumbles serve as a rich source of communication within their species, providing unique insights into their behavioural patterns and interactions within their habitat.
  • 3.By harnessing this data, the research seeks to provide real-time updates on the ecological dynamics of the jungle environment.
  • 4.This approach aligns with contemporary advancements in technological methodologies, particularly in the realm of machine learning.
  • 5.The research underscores a commitment to enhancing wildlife conservation efforts through innovative and scientifically rigorous means.


    1. Infrasonic Calls:

    • Elephants produce infrasonic calls below 20 Hz, crucial for long-distance communication in both Asian and African elephants.
    • Asian elephants: Frequency ranges from 14 to 24 Hz, sound pressure levels of 85–90 dB, lasting 10–15 seconds.
    • African elephants: Calls range from 15 to 35 Hz, sound pressure levels up to 117 dB, facilitating communication over many kilometres, with a potential range of approximately 10 km (6 mi).

    2. Identified Infrasonic Calls at Amboseli National Park:

    • A.Greeting Rumble:   Emit by adult female members reuniting after separation for hours.
    • B.Contact Call:  Soft, unmodulated sounds from an individual separated from the group within 2 km (1.2 mi).
    • C.Contact Answer:   Response to contact call, initially loud, softening toward the end.
    • D."Let's Go" Rumble:  Soft rumble from the matriarch signalling herd members to move.
    • E.Musth Rumble:   Low-frequency pulsated rumble from musth males, known as the "motorcycle."
    • F.Female Chorus:   Low-frequency, modulated chorus from several cows in response to a musth rumble.
    • G.Postcopulatory Call:   Made by an oestrous cow after mating.
    • H.Mating Pandemonium:   Excitement calls from a cow's family after mating.

    Speech-to-text (STT) technology, also known as automatic speech recognition (ASR), converts spoken language into written text. Machine learning plays a crucial role in this process. Here's an overview of how speech-to-text works and how machine learning is used:

    • Audio Input:  Capture audio from sources like microphones or recordings.
    • Preprocessing:  Enhance audio quality by filtering noise and normalizing volume.
    • Feature Extraction:   Extract relevant features like MFCC (Mel-Frequency Cepstral Coefficients) to represent sound.
    • Acoustic Modeling:   Use machine learning (CNNs, RNNs) to map acoustic features to phonemes or sub-word units.
    • Language Modeling:   Predict word sequences using N-gram models, RNNs, or transformers trained on text corpora.
    • Decoding:   Combine acoustic and language models using algorithms like HMMs or sequence-to-sequence models.
    • Post-processing:   Improve accuracy through spell-checking and punctuation correction.
    • Output:   Provide recognized text for various applications.

    Point to be noted:

    • 1.Machine learning models are trained on large datasets with audio recordings and transcriptions.
    • 2.Training optimizes model parameters using gradient-based algorithms like SGD.
    • 3.Performance relies on data quality, model architecture, and algorithm effectiveness.
    • 4.Advancements in data and ML techniques drive improved accuracy and robustness.

    A modified approach of speech-to-text is being used to decode the rumbles made by elephants.

    Many of the techniques used in speech-to-text (STT) can be adapted for other audio recognition tasks, including analyzing low-frequency elephant rumbles. However, there are some important considerations and adaptations needed:

    • • Feature Extraction:   Traditional MFCCs may struggle for low-frequency sounds like elephant rumbles. In addition to traditional MFCCs, we will explore the efficacy of employing long-term spectral features or modulation spectra, which are specifically designed for capturing low-frequency sounds such as elephant rumbles.
    • • Acoustic Modelling:   Machine learning models, particularly deep learning models like convolutional neural networks (CNNs) or recurrent neural networks (RNNs), can still be used for acoustic modelling. However, the architecture and training procedure may need to be adjusted to accommodate the characteristics of low-frequency audio.
    • • Training Data:   Collecting annotated data for low-frequency elephant rumbles might be challenging, but if you have access to an expert who can annotate the calls, you can use this annotated data to train your machine-learning models. The dataset should be representative of the variability in elephant rumbles, including different individuals, contexts, and environmental conditions.
    • • Model Evaluation:   It's essential to evaluate the performance of your models using appropriate metrics for audio recognition tasks. Depending on your specific objectives, metrics such as accuracy, precision, recall, or F1-score may be relevant.
    • • Domain Knowledge:   Incorporating domain knowledge about elephant behavior, communication patterns, and environmental factors can help improve the performance of your models. This knowledge can guide the selection of features, the design of the model architecture, and the interpretation of the results.
    • • Real-World Deployment:   Consider the practical constraints and requirements for deploying your system in real-world settings. This might include considerations such as computational efficiency, robustness to environmental noise, and the ability to operate in real-time or on resource-constrained devices.

Traditional MFCC, Long-term Spectral Features And Modulation Spectra

import librosa as librosa
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import librosa.display
from IPython.display import Audio
import pandas as pd
import os
from sklearn.model_selection import train_test_split
import skimage.io

Load sample audio file and display the raw waveform (time domain)

y, sr = librosa.load('/content/begging1.mp3', sr=32000)
# y, sr = librosa.load('/content/Elephant calls for companions.wav', sr=32000)
librosa.display.waveshow(y, sr= sr, x_axis='s')
print("The sampled audio is returned as a numpy array (time series) and has ", y.shape, " number of samples")
print("The 10 randomly picked consecutive samples of the audio are: ", y[3000:3010])


The sampled audio is returned as a numpy array (time series) and has  (164676,)  number of samples The 10 randomly picked consequitive samples of the audio are:  [-0.00036051  0.0018779  -0.00192935  0.00093846 -0.00352445 -0.01220394 -0.01800036 -0.02024757 -0.0188582  -0.01652359]


To grasp the concept of a spectrogram, it's essential to comprehend what a spectrum entails. The spectrum refers to the collection of frequencies present in a specific signal, with the fundamental frequency being the lowest. Harmonics, which are frequencies that are integer multiples of the fundamental frequency, are also part of the spectrum. As signals, especially non-periodic ones, exhibit changes in their spectrum over time, a common approach is to analyze small fixed sections of the signal sequentially. This process, known as Short Time Fourier Transform (STFT), involves dividing the sampled signal into equal segments and performing Fourier Transform on each segment individually. These spectra are then stacked together to form the spectrogram, represented as a matrix.

It's important to note that the Fourier Transform is employed to determine the spectrum of a signal in the time domain. In STFT, the signal is divided into equal parts, and the Fourier Transform is applied to each part separately. Consequently, when conducting STFT on a signal, the window size, or the number of samples considered at a time, needs to be specified.

For further understanding, the provided sources delve into essential concepts such as spectrum, windowing, and Short Time Fourier Transform (STFT).
• https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fwww.phon.ucl.ac.uk%2Fcourses%2Fspsci%2Facoustics%2Fweek1-10.pdf
• https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fdownload.ni.com%2Fevaluation%2Fpxi%2FUnderstanding%2520FFTs%2520and%2520Windowing.pdf • https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Ftowardsdatascience.com%2Faudio-deep-learning-made-simple-part-1-state-of-the-art-techniques-da1d3dff2504

# Size of the Fast Fourier Transform (FFT), which will also be used as the window length
# Step or stride between windows. If the step is smaller than the window length, the windows will overlap
# Specify the window type for FFT/STFT
window_type ='hann'
# Calculate the spectrogram as the square of the complex magnitude of the STFT
spectrogram_librosa = np.abs(librosa.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=n_fft, window=window_type)) ** 2

Transform the spectrogram output to a logarithmic scale by transforming the amplitude to decibels and frequency to a mel scale

Mel Spectrogram

The Mel spectrogram utilizes a non-linear transformation of the frequency scale known as the mel scale, which is based on human perception of pitch. This scale ensures that two pairs of frequencies separated by a constant delta in the mel scale are perceived as equally distant by humans.

In machine learning applications involving speech and audio analysis, it is common to represent the power spectrogram using the mel scale. This is achieved by employing a bank of overlapping triangular filters, known as the mel filter bank, which calculates the energy of the spectrum within each frequency band.

The Mel spectrogram is characterized by its shape, which is determined by the number of mel bands and the frame size (half of the FFT components, denoted as n_fft). Specifically, its dimensions are [number of mel bands x (frame_size/2) + 1].

mel_bins = 64 # Number of mel bands
fmin = 0
fmax= None
Mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=n_fft, window=window_type, n_mels = mel_bins, power=2.0)
print("The shape of mel spectrogram is: ", Mel_spectrogram.shape)

librosa.display.specshow(Mel_spectrogram, sr=sr, x_axis='time''mel',hop_length=hop_length)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')


Mel Spectrogram

Move from power (mel) spectrum and apply log and move amplitude to a log scale (decibels). While doing so we will also normalize the spectrogram so that its maximum represents the 0 dB point.

https://stackoverflow.com/questions/52432731/store-the-spectrogram-as-image-in-python/52683474 - How to save the figure in the working directory.

https://stackoverflow.com/questions/56719138/how-can-i-save-a-librosa-spectrogram-plot-as-a-specific-sized-image/57204349#57204349 - if the desire is to save the data in the spectrogram (not the image itself)

import matplotlib.pyplot as plt
import numpy as np

# Calculate the aspect ratio for horizontal and vertical stretching
aspect_ratio_horizontal = mel_spectrogram_db.shape[1] / mel_spectrogram_db.shape[0]
aspect_ratio_vertical = mel_spectrogram_db.shape[0] / mel_spectrogram_db.shape[1]

# Limit the maximum size of the figure
max_fig_width = 20  # Maximum width in inches
max_fig_height = 8  # Maximum height in inches
max_aspect_ratio = max_fig_width / max_fig_height

# Adjust the aspect ratio if it exceeds the maximum
if aspect_ratio_horizontal > max_aspect_ratio:
    aspect_ratio_horizontal = max_aspect_ratio
if aspect_ratio_vertical > max_aspect_ratio:
    aspect_ratio_vertical = max_aspect_ratio

# Create a new figure with adjusted aspect ratio
fig, ax = plt.subplots(figsize=(max_fig_width, max_fig_height))

# Plot the Mel spectrogram
img = ax.imshow(mel_spectrogram_db, origin='lower', aspect='auto', cmap='viridis', extent=[0, mel_spectrogram_db.shape[1]*hop_length/sr, 0, mel_spectrogram_db.shape[0]])

# Create colorbar using the image object
plt.colorbar(img, ax=ax, format='%+2.0f dB')
plt.title('Log Mel spectrogram')
plt.xlabel('Time (s)')
plt.ylabel('Mel Frequency')

Blog Visualize the mel filter bank

mel_filter_bank = librosa.filters.mel(sr=sr, n_fft=n_fft, n_mels=mel_bins, fmin=0.0, fmax=None, htk=False, norm='slaney')
print("The shape of the mel filter bank is: ", mel_filter_bank.shape)
librosa.display.specshow(mel_filter_bank, sr=sr, x_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel filter bank')


import librosa
import librosa.display
import matplotlib.pyplot as plt

def extract_long_term_spectral_features(audio_path, n_fft=2048, hop_length=320):
Extracts long-term spectral features from an audio file.

- audio_path (str): Path to the audio file.
- n_fft (int): Number of samples used for each Fourier Transform.
- hop_length (int): Hop length (in samples) for the STFT. Controls the time resolution of the spectrogram.
    - long_term_spectral_features (ndarray): Extracted long-term spectral features.

     # Load audio file
    y, sr = librosa.load(audio_path)

     # Compute short-time Fourier transform (STFT)
    stft = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

     # Compute magnitude spectrogram
    magnitude_spectrogram = np.abs(stft)

     # Transpose magnitude spectrogram
    magnitude_spectrogram = np.transpose(magnitude_spectrogram)

    return magnitude_spectrogram
# Usage:
audio_path = "/content/begging1.mp3"
long_term_spectral_features = extract_long_term_spectral_features(audio_path)
print("Long-term spectral features shape:", long_term_spectral_features.shape)

# Normalize the spectrogram
normalized_spectrogram = librosa.util.normalize(long_term_spectral_features)
print("normalized_spectrogram features shape:", normalized_spectrogram.shape)
# Plot spectrogram
plt.figure(figsize=(10, 5))
librosa.display.specshow(normalized_spectrogram, sr=sr, hop_length=hop_length, x_axis='time', y_axis='linear')
plt.colorbar(format='%+2.0f dB')
plt.title('long-term spectral features')
plt.xlabel('Time (s)')
plt.ylabel('Frequency (Hz)')


import librosa
import librosa.display
import matplotlib.pyplot as plt

def extract_modulation_spectra(audio_path, n_fft=2048, hop_length=512):
    Extracts modulation spectra from an audio file.

    - audio_path (str): Path to the audio file.
    - n_fft (int): Number of samples used for each Fourier Transform.
    - hop_length (int): Hop length (in samples) for the STFT. Controls the time resolution of the spectrogram.

    - modulation_spectra (ndarray): Extracted modulation spectra.

     # Load audio file
    y, sr = librosa.load(audio_path)

     # Compute short-time Fourier transform (STFT)
     stft = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)

     # Compute modulation spectra
    modulation_spectra = np.abs(librosa.feature.melspectrogram(S=stft))

    return modulation_spectra

# Example usage:
audio_path = "/content/begging1.mp3"
modulation_spectra = extract_modulation_spectra(audio_path)
print("modulation spectra shape:", modulation_spectra.shape)

# Normalize the modulation spectra differently for better contrast
normalized_modulation_spectra = librosa.util.normalize(modulation_spectra, axis=1)
print("normalized_modulation_spectra shape:", normalized_modulation_spectra.shape)

# Plot modulation spectra
plt.figure(figsize=(10, 5))
librosa.display.specshow(normalized_modulation_spectra, sr=sr, hop_length=hop_length, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Modulation Spectra')
plt.xlabel('Time (s)')
plt.ylabel('Modulation Frequency (Hz)')


After obtaining the desired features, the next step typically involves feeding these features into a machine learning model for further analysis and processing. This could entail tasks such as classification, clustering, regression, or any other relevant task depending on the specific objectives of the application. The machine learning model leverages the extracted features to learn patterns and relationships within the data, ultimately enabling it to make predictions or perform tasks that are beneficial for the given application. This phase of the workflow often involves training the model on labelled data to learn from examples, followed by evaluation on unseen data to assess its performance and generalization capabilities. Additionally, fine-tuning of model parameters and feature selection techniques may be employed to optimize performance and enhance interpretability. Overall, the process of extracting features and leveraging them in a machine-learning model is crucial for deriving meaningful insights and making informed decisions in various domains.

auctor lectus better best conbia euismot rhoncus dolora gorgeous system.

Contact Info

Get Consulting