Training Data

Introduction

Due to the advanced phoneme-level speech synthesis used in this model, the audio files require preprocessing. These preprocessing steps are part of standard music production processes. If there are any parts you are unsure about, please discuss them with your music producer or audio engineer, or contact us to confirm the relevant details.

Audio Format

We only accept the following formats:

format : wave
bits : 16bits
sample rate : 44.1kHz
channel : mono

File content: Pure single-track solo vocal recordings without any spatial effects, other instruments, unison, or dubbing.

Audio Preprocess

Dynamic Preprocess

Please perform dynamic processing and limiting using a dynamic effects processor to maintain the singer's output within an appropriate range, including but not limited to:

Compressor
Expander
Gate
Parallel compression
De-Esser
EQ
Limiter
Multi-Band Compressor
De-Click
De-Noise

Pitch Correction (Option)

Pitch correction is not mandatory. If your file has already undergone pitch correction, please provide the corrected vocal file. However, if it hasn't, it will not affect the model's quality,including but not limited to:

Melodyne
Revoice
Auto-Tune

[DO NOT]Spatial Effects

The use of any spatial effects is strictly prohibited, including but not limited to:

Delay
Reverb

Dubbing/Backing Vocals/Unison

If the singer has recorded multiple parts as backing or unison, please provide them as separate files. We can accept multiple wave files of the same song with the same (or different) parts. Do not combine them into a single wave file, as this would be equivalent to using spatial effects and will severely compromise the model's performance.

File cataloging method (please strictly follow):

Please place each song in a separate folder. Each folder should contain the audio files of the song, with the audio already segmented into individual sentences.
Inside each song folder, include a plain text file named "list.txt". This text file should list the file names and their corresponding lyrics.
PS: SONGS and SENTENCES Folder names can be customized, but must only use English letters, numbers, dashes, and underscores.

training dataset 
│
└───song1
│   │   sentence1.wav
│   │   sentence2.wav
│   │   sentence3.wav
│   │   ...
│   │   list.txt
│   
└───song2
│   │   sentence1.wav
│   │   sentence2.wav
│   │   sentence3.wav
│   │   ...
│   │   list.txt
│   
└───song3
│   │   sentence1.wav
│   │   sentence2.wav
│   │   sentence3.wav
│   │   ...
│   │   list.txt
│
...

The format for the list.txt file should be: sentence_name.wav | lyrics. Note that the sentence includes the file extension, and the two are separated by a | symbol (not a space or a tab). Each line represents one sentence. Please label each sentence one by one until the entire song is complete.

#list.txt (example)
sentence1.wav | Twinkle Twinkle little star
sentence2.wav | How I wonder what you are
sentence3.wav | Up above the world so high
sentence4.wav | Like a diamond in the sky
sentence5.wav | Twinkle twinkle little star
sentence6.wav | How I wonder what you are

Recommended length of training data:

The length of the file is not the key factor in determining the quality of the model, as very long wave files may contain a lot of silence without any singing. On the other hand, since we are using a phoneme-level model, we focus more on the distribution of phonemes rather than length. Therefore, we recommend:

Total singing duration > 1 hour
All phonemes should be included, with at least 20 instances of each phoneme (recommended standard is 50).

If you do not know how to calculate phonemes and duration, please submit the files to our engineers. We will calculate them for you, and based on the results, you can decide whether to add more training data or proceed directly with training.