Multi-modal learning

CS 공부/AI 2023. 4. 6. 04:42

음성, 이미지, 텍스트 모두 데이터의 표현 방식이 다르다
여러modality를 사용할때 fair하게 사용하면 무조건 좋을까?
- NO! 학습을 하게되면 쉬운 modality에 편향되기도 한다.
- 어려운 데이터는 학습하지 않는 문제

Multi-modal learning 종류
- Matching: 서로 다른 형태의 데이터를 하나의 space로 보내서 매칭함
- Translating: 하나의 데이터를 다른 형태의 데이터로 translate
- Referencing: 다른 형태의 데이터 참조

Multi-modal task: Vision & Text data

Text embedding

문자는 머신러닝에서 사용하기 힘들다, word mapping이 필요
두개의 단어 사이의 관계가 일반화가 가능
- 예를 들어 man과 woman 단어가 k만큼의 거리 차이가 난다고 하면 king 단어에 k만큼의 거리를 이동하면 queen이 나온다! 싱기방기

word2vec Skip-gram model

중심 단어를 통해 주변의 N개의 단어를 예측하는 모델
- N=5, 'The quick brown fox jumps over the lazy dog'라는 문장이 있다고 하면 처음 5개 단어에서 brown를 중심으로 4개의 단어쌍이 만들어진다. (brown,the)...(brown,jumps)
중심단어(x_k의 중간)를 one-hot 벡터로 만들어주고 W_vxn을 곱해주어 embedded 벡터를 구한다 (hidden layer)
embedded 벡터를 W'와 곱해서 score 벡터를 계산한다. 그리고 softmax를 취해준다 (Output layer)

Joint embedding (Matching)

이미지 tagging

이미지를 보고 tag를 만들거나 tag를 보고 이미지를 찾거나
pre-trained된 unimodal 모델들을 결합
Metric learning : 텍스트와 이미지를 같은 embedding space에 매핑해주고 embedding 간의 거리를 줄일지 멀리할지 학습

Image & food recipe retrieval

이미지를 보고 레시피 추천 or 레시피 보고 이미지 추천
재료 + 설명 text RNN에 넣어서 하나의 fixed 벡터로 만들어줌, 이미지도 하나의 feature 벡터로 만들어준다.
- loss1: 이미지와 텍스트의 연관성에 따라 cosine similarity loss 낮게 또는 높게 설정
- loss2: loss1로 해결안되는 부분 semantic regluarization loss로 해결 (high-level semantic)

Cross modal translation (Translating)

Image-to-sentence (Show and tell 모델)

CNN for image & RNN for sentence
인코더에는 ImageNet에서 pretrained된 CNN 모델 사용
디코더에는 LSTM 모듈 사용 (end token이 나올 때까지 반복)

Show, attend, and tell 모델

Show and tell에 attention을 적용시켜 그 성능을 향상시킴 (해당 단어가 나타내는 이미지의 위치에 가중치 적용)
이미지를 넣으면 CNN을 통과해서 14x14 feature map이 나오고, 이 feature map을 RNN을 통과시켜 어떤 부분을 참조해야할지 heatmap으로 만들어준다. 그 다음 heatmap과 feature map을 weighted sum해주면 z가 출력된다.

인코딩 이미지와 이전 hidden state는 attention에서 각 pixel별 가중치 생성에 사용된다. 이전에 생성된 단어와 인코딩의 weighted sum은 LSTM 디코더에 들어가 다음 단어를 생성한다.

Text-to-image by generative model

문장이 주어졌을 때 영상을 생성
input 문장이 generator network를 거쳐서 이미지가 생성됨
생성된 이미지와 input 문장 정보를 같이 학습시켜 sentence 조건하에 이미지가 True인지 False인지 판단

Cross modal reasoning (Referencing)

Visual question answering

영상과 질문이 주어지면 답을 도출한다.
attention으로 영상을 reference하면서 다음 step을 결정

Image stream과 Question stream이 존재한다. 각각의 stream에서 fixed dimensional 벡터를 출력
두개의 벡터를 joint embedding 시킴
end-to-end training

Multi-modal task: Vision & Audio data

Sound 표현 방식

1D 신호를 Spectrogram형태의 Acoustic feature로 변환해야함

Short-time Fourier transform (STFT): 짧은 구간 내에서만 FT 적용
- 적당한 간격으로 overlapping하면서 스펙트럼(주파수 크기 그래프)으로 변환
- FT는 입력 신호를 주파수 성분으로 분해
Spectrogram: 시간에 따라서 주파수의 크기가 어떻게 변하는지 나타냄

Joint embedding (Matching)

SoundNet

비디오의 각 frame들을 Visual Recognition Networks에 넣어준다
- ImageNet CNN에서는 어떤 object인지 나타내는 object 분포 출력
- Places CNN에서는 어떤 장면인지 나타내는 scene 분포 출력
오디오를 Raw Waveform형태로 추출하여 1d CNN 구조에 넣어준다
마지막 layer에서 두개의 head로 분리한다
두개의 head를 KL divergence로 object 인식, scene 인식을 할 수 있도록 함
- 두개의 object distribution과 scene distribution을 따라하도록 함
- KL divergence가 최소가 되는 방향으로
image는 fix한 상태로 sound만 학습시킴 (teacher-student 방식, transfer learning)

pre-trained된 시각 인식 모델에서 오디오 양식으로 시각적 지식을 전달한다
target task의 경우 pre-trained된 pool5로 classifier를 train시킨다
- output layer보다 pool5가 일반화 가능한 semantic 정보를 더 많이 보유하고 있기 때문이다

Cross modal translation (Translating)

Speech2Face

음성으로부터 사람의 얼굴을 상상해내는 네트워크
얼굴 이미지와 voice가 paired data로 간주되어 annotation이 필요 없는 self-supervised 방식
VGG-Face Model 사용
Face Decoder을 미리 학습시켜 정면 얼굴을 재건한다
voice feature가 face feature를 따라하도록 학습

face feature와 voice feature가 호환이 되도록 학습하면 바로 face decoder에 넣어줄 수 있음

Image-to-speech synthesis

이미지를 넣어주면 speech 출력해주는 network
앞부분은 Show, attend, and tell 모델 구조와 동일. 다른 점은 word 단위로 출력하지 않고 sub-word unit(token) 형태로 출력됨
뒷부분은 Tacotron2 구조로 sub-word unit을 speech로 변환해줌
Learned Units가 앞부분과 뒷부분이 호환될 수 있도록 유도
- speech에서 unit을 추출
- 절반으로 나눠서 image-to Unit model과 Unit-to-speech model에 사용

Cross modal reasoning (Referencing)

Sound source localization

소리가 어디에서 나는지 영상에서 위치를 찾는 task

이미지와 오디오 input이 CNN구조의 Visual 네트워크와 Audio 네트워크를 거쳐 fixed dimensional 벡터를 얻는다
Visual 네트워크의 spatial feature과 Audio feature를 attention 네트워크에 넘겨줌
두개의 feature를 내적하면 Localization score을 얻을 수 있다.
Supervised 버전: 여러가지 loss를 사용하여 localization score 학습
Unsupervised 버전: localization score와 spatial feature를 element-wise하여 Attended visual feature 추출
- attended visaul feature와 input 오디오를 비교해서 같은 이미지에서 나왔는지 판단 (true, false)
Semi-supervised 버전: supervised loss와 unsupervised loss를 모두 사용

Looking to listen at the cocktail party (논문)

여러 사람이 동시에 말하는 동영상에서 각각의 speaker 분리

학습 데이터: 두 개의 깨끗한 음성 비디오를 결합하여 합성적으로 생성
loss: 'clean spectrogram'과 'enhanced spectrogram'간의 L2 loss 사용

+ 공부해보기

+ 테슬라 self-driving Autopilot

+ Lip movements generation - Synthesizing Obama: 음성을 듣고 사람의 입모양 움직임 생성

'CS 공부 > AI' 카테고리의 다른 글

소프트웨어 엔지니어링 (0)	2023.04.24
📂 3D dataset & 3D task (0)	2023.04.07
📷 Conditional generative model (0)	2023.04.05
📸 Instance/Panoptic segmentation + Landmark localization (0)	2023.04.04
CNN Visualization (결과 분석 기법) (0)	2023.04.03

ABOUT ME

Carpe Diem Carpe Diem

Multi-modal task: Vision & Text data

Text embedding

Joint embedding (Matching)

Cross modal translation (Translating)

Cross modal reasoning (Referencing)

Multi-modal task: Vision & Audio data

Sound 표현 방식

Joint embedding (Matching)

Cross modal translation (Translating)

Cross modal reasoning (Referencing)

'CS 공부 > AI' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Multi-modal task: Vision & Text data

Text embedding

Joint embedding (Matching)

Cross modal translation (Translating)

Cross modal reasoning (Referencing)

Multi-modal task: Vision & Audio data

Sound 표현 방식

Joint embedding (Matching)

Cross modal translation (Translating)

Cross modal reasoning (Referencing)

'CS 공부 > AI' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바