Multimodal AI Assistant with Memory and Speech Multimodal Input Support: Accepts images, video, text, and audio inputs, allowing for versatile applications such as visual question answering, speech recognition, and more. Real-Time Speech Interaction: Supports bilingual real-time speech conversations with configurable voices, including features like emotion, speed, and style control, as well as end-to-end voice cloning and role play. GitHub
To upload files, please first save the app
import streamlit as st
import numpy as np
import cv2
from pydub import AudioSegment
import speech_recognition as sr
st.title('Multimodal AI Assistant')
# Image upload
uploaded_image = st.file_uploader('Upload an Image', type=['jpg', 'jpeg', 'png'])
if uploaded_image is not None:
image_data = np.array(cv2.imdecode(np.frombuffer(uploaded_image.read(), np.uint8), cv2.IMREAD_COLOR))
st.image(image_data, caption='Uploaded Image', channels='BGR')
# Video upload
uploaded_video = st.file_uploader('Upload a Video', type=['mp4', 'mov', 'avi'])
if uploaded_video is not None:
st.video(uploaded_video)
# Text input
user_input_text = st.text_input('Enter text input:')
if user_input_text:
st.write('You entered:', user_input_text)
# Audio input
audio_input = st.experimental_audio_input('Record a voice message')
if audio_input:
st.audio(audio_input)
# Speech Recognition
recognizer = sr.Recognizer()
if audio_input:
audio_file = sr.AudioFile(audio_input)
with audio_file as source:
audio_data = recognizer.record(source)
try:
text = recognizer.recognize_google(audio_data)
st.write('Recognized Speech:', text)
except sr.RequestError:
st.error('API unavailable')
except sr.UnknownValueError:
st.error('Could not understand audio')
Hi! I can help you with any questions about Streamlit and Python. What would you like to know?