[AI Engineering & Web Development] Conversational AI service

상백 이
2022년 5월 25일
2분 분량

최종 수정일: 2022년 6월 3일

Figure 1. SSIFI logo

Project Summary

Many markets and companies are adopting conversational AI services. Markets and markets announced a CAGR of 21.8% for the conversational AI market, and predicted that USD 18.4 billion would focused on this market in 2026.

The AI architecture of SSIFI consists of Speech-to-Text (STT), Natural Language Process (NLP) and Text-to-Speech (TTS).

SSIFI provides conversational AI service. In addition, we open SSIFI's AI tech in GitHub for User who want to make personalized SSIFI.

Figure 2. Conversational AI market (Markets and markets)

Figure 3. SSIFI AI architecture

Service Name : SSIFI
Project Duration : 11. Apr. 2022 ~ 27. May.2022
Number of Team members : 6
Role : Part Leader of AI and Engineer
Skills

AI

1. Speech-to-Text (STT)

STT is a AI model that converts human speech language into text data through machine interpretation. As shown in Figure 4, the process is divided into pre-processing, acoustic model and language model.

Figure 4. Speech-to-Text model process

2. Natural Language Process (NLP)

A total of two language models were introduced in SSIFI. The first is Generative Pre-trained Transformer (GPT) and the second is Text-to-Image model (GLIDE).

2-1. Generative Pre-trained Transformer (GPT)

The GPT model has a structure in which the decoder of the Transformer model is overlapped. So it has good performance at predicting words after prompt. (shown in Figure 5)

SSIFI provides five Korean generation models using the GPT. These include chat bot, reporter bot, and novel bot.

Figure 5. Generative Pre-trained Transformer model sample

2-2. Text-to-Image model (GLIDE)

SSIFI provides GLIDE, an image generation model, in addition to the text generation model. The GLIDE is a model published by OpenAI in 2021 and trained using text-labeled images dataset. SSIFI receives a Korean prompt and outputs a matching image.

Figure 6 shows an example prompt and output of GLIDE.

Figure 6. Example outputs of GLIDE model

3. Text-to-Speech (TTS)

SSIFI provides not only text but also audio output by introducing the TTS model. As shown in Figure 7, TTS consists of an Acoustic model and a Vocoder model, which are Fast-speech2 and VOCGAN respectively.

TTS model of SSIFI was trained using Korean-Single-Speech dataset in Kaggle. (4.32 GB, 12 hours dataset)