Building Multi-Modal Models for Content Moderation on Social Media

Ayushi Trivedi Last Updated : 19 Sep, 2024

9 min read

Introduction

Imagine you’re scrolling through your favorite social media platform when, out of nowhere, an offensive post pops up. Before you can even hit the report button, it’s gone. That’s content moderation in action. Behind the scenes, platforms rely on sophisticated algorithms to keep harmful content at bay, and the rapid growth of artificial intelligence is transforming how it’s done. In this article, we’ll explore the world of content moderation, from how industries use it to safeguard their communities, to the AI-driven tools that make it scalable. We’ll dive into the differences between heuristic and AI-based methods, and even guide you through building your own AI-powered multimodal classifier for moderating complex content like audio and video. Let’s get started!

This article is based on a recent talk give Pulkit Khandelwal on Building Multi-Modal Models for Content Moderation on Social Media, in the DataHack Summit 2024.

Learning Outcomes

Understand the key role content moderation plays in maintaining safe online environments.
Differentiate between heuristic and AI-based approaches to content moderation.
Learn how feature extraction is accomplished using AI as well as how the content that comprised in multiple modes is classified.
To cultivate practical skills of creating a multimodal classifier with the help of several pre-trained models.
Learn about the threat and potential in the AI content moderation in the future.

What is Content Moderation and Why Is It Important?
Industry Use Cases of Content Moderation
Implications of Bad Speech
Heuristic vs. AI-Based Approaches to Content Moderation
Leveraging AI in Content Moderation
I3D – Inflated 3D ConvNet
VGGish: Adapting VGG Architecture for Advanced Audio Classification
Hands-on: Building a Multimodal Classifier
Frequently Asked Questions

What is Content Moderation and Why Is It Important?

Content Moderation is the process of reviewing, filtering, and assessing user generated content to purge it of undesirable material against the backdrop of specific standard both legal and social. With the help of new technologies internet grows rapidly and people use social media, video hosting, forums, etc., where so many materials are uploaded every minute. Moderation is significant in preserving users from dangerous, obscene, or fake information including, for instance, hatred speech, violence, or fake news.

Moderation therefore plays an important role in ensuring safety to social networking users thus develops trustful interaction. It also helps to avoid scandals on the further maintenance of the reliability of sites, compliance with the requirements of the legal framework, and reduce the likelihood of reputational losses. Effective moderation therefore has an important role to play in maintaining positive discourse on online communities, and ensures that it is a key factor for success for any business across industries such as social media, e commerce and games industries.

What is Content Moderation and Why Is It Important?

Industry Use Cases of Content Moderation

Various industries rely on content moderation to protect their users:

Social Media: Companies such as Facebook and Twitter use moderation methods to block the hate speech messages, violent content, and fake news.
E-commerce: Online hosting firm such as eBay as well as Amazon use moderation to keep the listed products legal and appropriate to the community.
Streaming Services: Services like YouTube censor videos based on issues to do with copyright infringement and indecent material.
Gaming: Multiplayer games employ several measures to avoid cases of harassment and hence unhealthy interaction of users in the chat facilities.
Job Portals: Screening of spam, fake, fake profiles, unregistered users as well as jobs that are unworthy or have nothing to do with employee competence.

Implications of Bad Speech

The consequences of harmful or offensive content, often referred to as “bad speech,” are vast and multi-dimensional. Psychologically, it can cause emotional distress, lead to mental health issues, and contribute to societal harm. The unchecked spread of misinformation can incite violence, while platforms face legal and regulatory repercussions for non-compliance. Economically, bad speech can degrade content quality, leading to brand damage, user attrition, and increased scrutiny from authorities. Platforms are also ethically responsible for balancing free speech with user safety, making content moderation a critical yet challenging task.

Heuristic vs. AI-Based Approaches to Content Moderation

Content moderation started with heuristic-based methods, which rely on rules and manual moderation. While effective to some extent, these methods are limited in scale and adaptability, especially when dealing with massive volumes of content.

In contrast, AI-based approaches leverage machine learning models to automatically analyze and classify content, enabling greater scalability and speed. These models can detect patterns, classify text, images, videos, and audio, and even handle different languages. The introduction of multimodal AI has further improved the ability to moderate complex content types more accurately.

Heuristic vs. AI-Based Approaches to Content Moderation

Leveraging AI in Content Moderation

In today’s digital landscape, AI plays a pivotal role in enhancing content moderation processes, making them more efficient and scalable. Here’s how AI is revolutionizing content moderation:

Feature Extraction Using AI

Machine learning is capable of recognizing important features in contents like; text, images, and even videos. In this manner, there is an identification of keywords, phrases, patterns of colors and images as well as sounds that are essential in classification. For instance, there are techniques such as natural language processing to parse text and understand it and computer vision models to evaluate images and videos for breaching the standard.

Pre-trained Models for Content Embeddings

AI leverages pre-trained models to generate embeddings, which are vector representations of content that capture semantic meaning. These embeddings help in comparing and analyzing content across different modalities. For instance, models like BERT and GPT for text, or CLIP for images, can be used to understand context and detect harmful content based on pre-learned patterns.

Multimodal Modeling Approaches

AI enhances content moderation by integrating multiple data types, such as text, images, and audio, through multimodal models. These models can simultaneously process and analyze different content forms, providing a more comprehensive understanding of context and intent. For example, a multimodal model might analyze a video by evaluating both the visual content and accompanying audio to detect inappropriate behavior or speech.

I3D – Inflated 3D ConvNet

I3D (Inflated 3D ConvNet), introduced by Google researchers in 2017, is a powerful model designed for video analysis. It expands on the traditional 2D ConvNets by inflating them into 3D, allowing for more nuanced understanding of temporal information in videos. This model has proven effective in accurately recognizing a diverse range of actions and behaviors, making it particularly valuable for content moderation in video contexts.

Key Applications

Surveillance: Enhances security footage analysis by detecting and recognizing specific actions, improving the ability to identify harmful or inappropriate content.
Sports Analytics: Analyzes player movements and actions in sports videos, offering detailed insights into gameplay and performance.
Entertainment: Improves content understanding and moderation in entertainment videos by distinguishing between appropriate and inappropriate actions based on context.

LSTM: Recurrent networks like Long Short-Term Memory (LSTM) are used for handling sequential data, complementing 3D ConvNet by processing temporal sequences in video data.
3D ConvNet: Traditional 3D Convolutional Networks focus on spatiotemporal feature extraction, which I3D builds upon by inflating existing 2D networks into a 3D framework.
Two-Stream Networks: These networks combine spatial and temporal information from videos, often integrated with I3D for enhanced performance.
3D-Fused Two-Stream Networks: These models fuse information from multiple streams to improve action recognition accuracy.
Two-Stream 3D ConvNet: Combines the strengths of both two-stream and 3D ConvNet approaches for a more comprehensive analysis of video content.

VGGish: Adapting VGG Architecture for Advanced Audio Classification

VGGish is a specialized variation of the VGG network architecture, adapted for audio classification tasks. Introduced by Google researchers, VGGish leverages the well-established VGG architecture, originally designed for image classification, and modifies it to process audio data effectively.

How It Works

Architecture: VGGish utilizes a convolutional neural network (CNN) model based on VGG, specifically designed to handle audio spectrograms. This adaptation involves using VGG’s layers and structure but tailored to extract meaningful features from audio signals rather than images.
Layer Configuration: It consists of multiple convolution layers having the receptive field of 3 × 3 and stride 1 × 1 and max-pooling layers with the receptive field of 2 × 2 and stride of 2 × 2. The five layers in the network are global average pooling to decrease dimensionality, fully connected layers, dropout layers in order to minimize the overfitting and a softmax layer to yield the prediction.
Feature Extraction: Since the sound can be analyzed by converting it into spectrograms which are pictures showing distributions of sounds by frequency, VGGish could function as a CNN by analyzing the different events by the use of sounds.

Applications

Audio Event Detection: Recognizes audio events in different context environments including urban sound environment to enhance the chances of identifying individual sounds within a complicated environment.
Speech Recognition: Improves upon the current speech recognition systems by incorporating effective strategies for the differentiation of various spoken words as well as other forms of phrases in a given language.
Music Genre Classification: Supports the categorization of the music genres based on the acoustics qualities that enables easy grouping and searching of music contents.

Hands-on: Building a Multimodal Classifier

Building a multimodal classifier involves integrating various data types. These include audio, video, text, and images. This approach enhances classification accuracy and robustness. This section will guide you through the essential steps and concepts for developing a multimodal classifier.

Overview of the Process

Hands-on: Building a Multimodal Classifier

Understanding the Multimodal Approach

Multimodal classification is similar to the single modality classification, whereby the model uses information from the various inputs to make the predictions. The first objective is to use the synergisms of each modality to optimize performance of the organization.

Data Preparation

Audio and Video: Prepare your input: gather or pull your audio and/or video data. For audio, create spectrograms and derive features vectors from them. For video, extract frames first. Then, use CNNs for feature extraction.
Text and Images: For textual data, start with tokenization. Next, embed the tokenized data for further processing. For images, perform normalization first. Then, use pre-trained CNN models for feature extraction.

Feature Extraction

Audio Features: Utilize models like VGGish to extract relevant features from audio spectrograms.
Video Features: Apply 3D Convolutional Networks (e.g., I3D) to capture temporal dynamics in video data.
Text Features: Use pre-trained language models like BERT or GPT to obtain contextual embeddings.
Image Features: Extract features using CNN architectures such as ResNet or VGG.

Annotations

Include multi-label annotations for your dataset, which help in categorizing each data point according to multiple classes.

Preprocessing

Temporal Padding: Adjust the length of sequences to ensure consistency across different inputs.
Datatype Conversion: Convert data into formats suitable for model training, such as normalizing images or converting audio to spectrograms.

Model Fusion

Feature Concatenation: Combine features from different modalities into a unified feature vector.
Model Architecture: Implement a neural network architecture that can process the fused features. This could be a fully connected network or a more complex architecture depending on the specific use case.

Training and Evaluation

Training: Train your multimodal model using labeled data and appropriate loss functions.
Evaluation: Assess the model’s performance using metrics like accuracy, precision, recall, and F1 score.

Extending to Other Modalities

Text and Image Integration: Incorporate text and image data by following similar preprocessing and feature extraction steps as described for audio and video.
Adaptation: Modify the model architecture as needed to handle the additional modalities and ensure proper fusion of features.

Conclusion

Developing multi-modal models for content moderation enhances cybersecurity. These systems integrate text, audio, and video data into one unified model. This integration helps distinguish between acceptable and unacceptable content. Combining various approaches improves the credibility of content moderation. It addresses the nuances of different interactions and content challenges. As social media evolves, multi-modal communication will need to advance as well. This evolution must maintain community values and safeguard against negative impacts of modern Internet communication.

Frequently Asked Questions

Q1. Can multi-modal models handle live video moderation?

A. Multi-modal models are not typically designed for real-time live video moderation due to the computational complexity, but advancements in technology may improve their capabilities in this area.

Q2. Are multi-modal models suitable for small-scale platforms?

A. Yes, multi-modal models can be scaled to fit various platform sizes, including small-scale ones, though the complexity and resource requirements may vary.

Q3. How do multi-modal models improve content moderation accuracy?

A. They enhance accuracy by analyzing multiple types of data (text, audio, video) simultaneously, which provides a more comprehensive understanding of the content.

Q4. Can these models be used for languages other than English?

A. Yes, multi-modal models can be trained to handle multiple languages, provided they are supplied with appropriate training data for each language.

Q5. What are the main challenges in building multi-modal content moderation systems?

A. Key challenges include handling diverse data types, ensuring model accuracy, managing computational resources, and maintaining system scalability.

Ayushi Trivedi

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Building Multi-Modal Models for Content Moderation on Social Media

Introduction

Learning Outcomes

Table of contents

What is Content Moderation and Why Is It Important?

Industry Use Cases of Content Moderation

Implications of Bad Speech

Heuristic vs. AI-Based Approaches to Content Moderation

Leveraging AI in Content Moderation

Feature Extraction Using AI

Pre-trained Models for Content Embeddings

Multimodal Modeling Approaches

I3D – Inflated 3D ConvNet

Key Applications

Related Models

VGGish: Adapting VGG Architecture for Advanced Audio Classification

How It Works

Applications

Hands-on: Building a Multimodal Classifier

Overview of the Process

Understanding the Multimodal Approach

Data Preparation

Feature Extraction

Annotations

Preprocessing

Model Fusion

Training and Evaluation

Extending to Other Modalities

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth