Debate Topic-wise Summary Pipeline with Indexify and Mistral
In this cookbook, we'll explore how to create a debate topic-wise summary pipeline using Indexify and Mistral's large language models. By the end of this document, you'll have a pipeline capable of processing video debates, extracting audio, performing speech recognition and diarization, and generating summaries for each topic discussed.
Table of Contents
- Introduction
- Prerequisites
- Setup
- Install Indexify
- Install Required Extractors
- Creating the Extraction Graph
- Implementing the Debate Summary Pipeline
- Running the Summary Pipeline
- Customization and Advanced Usage
- Conclusion
Introduction
The debate summary pipeline will consist of four main steps:
1. Video to Audio extraction using tensorlake/audio-extractor
2. Speech recognition and diarization using tensorlake/asrdiarization
3. Topic extraction using tensorlake/mistral
4. Topic-wise summarization using tensorlake/mistral
Prerequisites
Before we begin, ensure you have the following:
- Create a virtual env with Python 3.9 or later
pip
(Python package manager)- A Mistral API key
- Basic familiarity with Python and command-line interfaces
Setup
Install Indexify
First, let's install Indexify using the official installation script:
Start the Indexify server:
Install Required Extractors
Next, we'll install the necessary extractors in a new terminal:
pip install indexify-extractor-sdk
indexify-extractor download tensorlake/audio-extractor
indexify-extractor download tensorlake/asrdiarization
indexify-extractor download tensorlake/mistral
Once the extractors are downloaded, start them:
Creating the Extraction Graph
Create a new Python file called debate_summary_graph.py
and add the following code:
from indexify import IndexifyClient, ExtractionGraph
client = IndexifyClient()
extraction_graph_spec = """
name: 'debate_summarizer'
extraction_policies:
- extractor: 'tensorlake/audio-extractor'
name: 'video_to_audio'
- extractor: 'tensorlake/asrdiarization'
name: 'speech_recognition'
content_source: 'video_to_audio'
- extractor: 'tensorlake/mistral'
name: 'topic_extraction'
input_params:
model_name: 'mistral-large-latest'
key: 'YOUR_MISTRAL_API_KEY'
system_prompt: 'Extract the main topics discussed in this debate transcript. List each topic as a brief phrase or title.'
content_source: 'speech_recognition'
- extractor: 'tensorlake/mistral'
name: 'topic_summarization'
input_params:
model_name: 'mistral-large-latest'
key: 'YOUR_MISTRAL_API_KEY'
system_prompt: 'Summarize the discussion on the main topics from the debate transcript. Provide key points and arguments from both sides.'
content_source: 'speech_recognition'
"""
extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
Replace 'YOUR_MISTRAL_API_KEY'
with your actual Mistral API key.
Run this script to set up the pipeline:
Implementing the Debate Summary Pipeline
Now let's create a script to upload the video and retrieve the summaries. Create a file upload_and_retreive.py
:
import os
from indexify import IndexifyClient
def summarize_debate(video_path):
client = IndexifyClient()
# Upload the video file
content_id = client.upload_file("debate_summarizer", video_path)
# Wait for the extraction to complete
client.wait_for_extraction(content_id)
# Retrieve the extracted topics
topics = client.get_extracted_content(
content_id=content_id,
graph_name="debate_summarizer",
policy_name="topic_extraction"
)
topics = topics[0]['content'].decode('utf-8')
summaries = client.get_extracted_content(
content_id=content_id,
graph_name="debate_summarizer",
policy_name="topic_summarization"
)
summaries = summaries[0]['content'].decode('utf-8')
return topics, summaries
# Example usage
if __name__ == "__main__":
video_path = "biden_trump_debate_2024.mp4"
topics, summaries = summarize_debate(video_path)
print("Debate Topics and Summaries:")
print(topics, summaries)
Running the Summary Pipeline
To run the debate summary pipeline:
-
Ensure you have the video file of the Biden-Trump 2024 Presidential Debate saved as
biden_trump_debate_2024.mp4
in the same directory as your script. -
Run the Python script:
This will process the video, extract topics, and generate summaries for each topic discussed in the debate.
Customization and Advanced Usage
You can customize the summarization process by modifying the system_prompt
in the extraction graph. For example:
-
To focus on specific aspects of the debate:
-
To generate more concise summaries:
You can also experiment with different Mistral models by changing the model_name
parameter to find the best balance between speed and accuracy for your specific use case.
Conclusion
This debate topic-wise summary pipeline demonstrates the power and flexibility of Indexify for complex, multi-step processing tasks. Key advantages include:
- Scalability: Indexify can handle large video files and process multiple debates efficiently.
- Modularity: Each step in the pipeline (audio extraction, speech recognition, topic extraction, summarization) is separate, allowing for easy customization and improvement.
- Error Handling: Indexify automatically retries failed steps, ensuring robustness in processing.
Next Steps
- Explore more Indexify features in the official documentation
- Learn about other use cases, such as entity extraction from documents
- Experiment with different extractors and language models to improve the accuracy and depth of your debate summaries