The Talent500 Blog
Detecting Anomalies in Time Series with Python 1

Detecting Anomalies in Time Series with Python

Detecting Anomalies in Time Series with Python 2

Anomalies in time series data can indicate significant events, issues, or changes that require attention. Detecting these anomalies without delay is crucial for businesses to maintain smooth operations and prevent potential losses. In this blog, we will explore various techniques for detecting anomalies in time series data using Python, with a focus on practical implementation and real-world applications.

Why Our Business Needs Anomaly Detection?

Detecting Anomalies in Time Series with Python 3

Anomaly detection is essential for identifying unusual patterns that could signify potential problems. They could be problems such as drops in user engagement or performance issues in business operations. 

  • Manually inspecting metrics is time-consuming and prone to error, especially when dealing with large volumes of data. 
  • An automated anomaly detection system can provide early warnings.
  • This helps businesses stay ahead of issues and prevent revenue loss.

Consider a business scenario where an e-commerce platform experiences a sudden drop in user engagement.

  • Without an effective anomaly detection system, it might take days or weeks to identify the root cause.
  • This can lead to significant revenue loss.
  • Implementing a robust anomaly detection system ensures that such issues are identified promptl,
  • This helps achieve quick resolution and lesser impact.

Types of Anomalies and Choosing the Right Algorithm

Detecting Anomalies in Time Series with Python 4

Types of Time Series Anomalies

Point Anomalies

  • Single data points that deviate significantly from others.
  • These are often the easiest to detect.
  • This is also the most straightforward type of anomaly.

Contextual Anomalies

  • These Anomalies are context-dependent. 
  • This is considering seasonal or cyclical patterns. 
  • For example, an unusually high sales figure during a non-promotional period could be a contextual anomaly.

Collective Anomalies

  • Sequences of data points that collectively appear abnormal. 
  • This type often requires more sophisticated detection techniques.

Choosing the Right Algorithm

Selecting the appropriate anomaly detection algorithm depends on the type of anomaly you want to detect. Here, let us focus on:

Level Shift Anomalies: Sudden changes in data levels, which are crucial for detecting abrupt shifts in metrics.

Collective Anomalies: Continuous abnormal patterns, essential for monitoring sustained deviations.

Level Shift Anomaly Detection

  • Level shift anomalies are sudden changes in the baseline level of a time series. 
  • Detecting these anomalies helps identify sudden drops or increases in metrics.
  • This helps enable timely alerts.

ADTK Library

The ADTK (Anomaly Detection Toolkit) library in Python provides a function for level shift detection. It uses two sliding windows to calculate differences in statistical measures and identify anomalies.

 

python

 

from adtk.data import validate_series

from adtk.detector import LevelShiftAD

import pandas as pd

 

# Load and preprocess data

data = pd.read_csv(‘time_series_data.csv’)

data.index = pd.to_datetime(data[‘date’])

data.drop(columns=[‘date’], inplace=True)

data = validate_series(data)

 

# Level shift detection

model = LevelShiftAD(c=1.5, side=’both’, window=5)

anomalies = model.fit_detect(data)

 

# Plotting the results

import plotly.graph_objects as go

 

def plot_anomalies(data, anomalies, y_column):

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=data.index, y=data[y_column], mode=’lines’, name=’Value’))

    fig.add_trace(go.Scatter(x=anomalies.index, y=data[y_column][anomalies], mode=’markers’, name=’Anomaly’, marker=dict(color=’red’)))

    fig.show()

 

plot_anomalies(data, anomalies, ‘value’)

Isolation Forest Method

Isolation Forest is an ensemble method that isolates anomalies by randomly partitioning the data. Anomalies, being few and different, require fewer partitions to isolate.

Implementation

Isolation Forest can detect collective anomalies efficiently and works well with high-dimensional data.

python

 

from sklearn.ensemble import IsolationForest

 

# Load and preprocess data

data = pd.read_csv(‘time_series_data.csv’)

data.index = pd.to_datetime(data[‘date’])

data.drop(columns=[‘date’], inplace=True)

 

# Isolation Forest detection

model = IsolationForest(contamination=0.1, random_state=42)

data[‘anomaly’] = model.fit_predict(data)

 

# Plotting the results

def plot_isolation_forest(data, y_column):

    anomalies = data[data[‘anomaly’] == -1]

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=data.index, y=data[y_column], mode=’lines’, name=’Value’))

    fig.add_trace(go.Scatter(x=anomalies.index, y=anomalies[y_column], mode=’markers’, name=’Anomaly’, marker=dict(color=’orange’)))

    fig.show()

 

plot_isolation_forest(data, ‘value’)

Implementation Details in Python

Data Loading and Preprocessing

 

Ensure data is properly formatted, with datetime indices and appropriate data types.

 

python

 

import pandas as pd

 

def load_data(filename, x_column, y_column):

    df = pd.read_csv(filename)

    df.index = pd.to_datetime(df[x_column])

    df.drop(columns=[x_column], inplace=True)

    df[y_column] = df[y_column].astype(float)

    return df

 

data = load_data(‘time_series_data.csv’, ‘date’, ‘value’)

Visualization

Visualizing anomalies helps in interpreting the results.

 

python

 

import plotly.graph_objects as go

 

def plot_anomalies(df, y_column, anomalies):

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df.index, y=df[y_column], mode=’lines’, name=’Value’))

    fig.add_trace(go.Scatter(x=anomalies.index, y=anomalies[y_column], mode=’markers’, name=’Anomaly’, marker=dict(color=’red’)))

    fig.show()

 

plot_anomalies(data, ‘value’, data[data[‘anomaly’] == -1])

Real-World Example: Detecting DAU Anomalies

Problem Encountered

In November 2022, a company noticed a significant drop in the number of daily active users (DAU). This coincided with the release of a new app version, leading to a bug that affected user updates.

Implementation

We used level shift detection and isolation forest methods to identify and alert on anomalies.

python

 

# Load DAU data

dau_data = load_data(‘dau.csv’, ‘Date’, ‘DAU’)

 

# Level Shift Detection

dau_anomalies_ls = level_shift_anomaly(dau_data)

 

# Isolation Forest Detection

dau_anomalies_if = isolation_forest(dau_data)

 

# Combine results

dau_data[‘level_shift_anomaly’] = dau_anomalies_ls

dau_data[‘isolation_forest_anomaly’] = dau_anomalies_if

 

# Plot combined results

def plot_combined_anomalies(df, y_column, level_shift_anomalies, isolation_forest_anomalies):

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df.index, y=df[y_column], mode=’lines’, name=’Value’))

    fig.add_trace(go.Scatter(x=level_shift_anomalies.index, y=level_shift_anomalies[y_column], mode=’markers’, name=’Level Shift Anomaly’, marker=dict(color=’blue’)))

    fig.add_trace(go.Scatter(x=isolation_forest_anomalies.index, y=isolation_forest_anomalies[y_column], mode=’markers’, name=’Isolation Forest Anomaly’, marker=dict(color=’green’)))

    fig.show()

 

plot_combined_anomalies(dau_data, ‘DAU’, dau_data[dau_data[‘level_shift_anomaly’]], dau_data[dau_data[‘isolation_forest_anomaly’]])

Combining Detection Methods

Detecting Anomalies in Time Series with Python 5

By combining multiple anomaly detection methods, we can improve the robustness and accuracy of our anomaly detection system. Each method has its strengths and weaknesses. Using them together can provide a better detection capability.

Example: Combining Level Shift and Isolation Forest

In the previous example, we used both level shift detection and isolation forest methods. This combination allowed us to detect sudden changes in data levels and continuous abnormal patterns effectively.

python

 

# Combining anomalies from both methods

dau_data[‘combined_anomaly’] = dau_data[‘level_shift_anomaly’] | dau_data[‘isolation_forest_anomaly’]

 

# Plot combined results

def plot_final_anomalies(df, y_column):

    anomalies = df[df[‘combined_anomaly’]]

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=df.index, y=df[y_column], mode=’lines’, name=’Value’))

    fig.add_trace(go.Scatter(x=anomalies.index, y=anomalies[y_column], mode=’markers’, name=’Anomaly’, marker=dict(color=’red’)))

    fig.show()

plot_final_anomalies(dau_data, ‘DAU’)

Conclusion

Detecting anomalies in time series data is crucial for maintaining business performance and preventing potential issues. By using Python libraries like ADTK and machine learning algorithms like Isolation Forest, we can efficiently identify and address anomalies. This comprehensive approach ensures timely alerts and enables businesses to take proactive measures, ultimately safeguarding user experience and revenue.

0
Taniya Pan

Taniya Pan

Add comment