Glossary

グロスラリ

An all-in-one glossary of VocalSynth terms and vernacular for newcomers

and visitors with a passing interest.

Note: Entries are subject to contain some amount of editorialising for the purpose of summarising a general consensus or community atmosphere surrounding a specific concept, these are all subject to change over time.

You can select any heading from the list below to jump to that section.

General Vocalsynth Terms:

- Vocal Synthesizer

- Voicebank

- "Standard" Voicebank

- AI Voicebank

- Append / Expression

- Cross-Synthesis (XSY)

- Cross-Lingual Synthesis (XLS)

Examples of Vocal Synthesizers:

- VOCALOID

- UTAU

- OpenUTAU

- Synthesizer V

- Piapro Studio / NT

- CeVIO

- VOCALO CHANGER

- Vocoflex

Relevant Linguistic Terms:

- Phoneme

- Dictionary

- Vowel

- Consonant

- Diphone

- Triphone

- Diphthong

- Romaji

- IPA

- X-SAMPA

- VOCALOID SAMPA

- ARPABET

- CZ Phonemes

Production Terms & File Extenstions:

- Tuning

- Mixing

- Translyrics

- Plug & Play

- Vocal Morphing

- MIDI

- VOCALOID MIDI, VSQ, VSQX & VPR

- UST & USTX

- JSON, S5P & SVP

UTAU-Specific Terminology:

- Flag

- Sampler

- Voicebank Type / Method

- Reclist

- Oto, Alias, Offset, Cutoff, Preutterance, Overlap & Consonant

- Pitch

- Labelling & Training

- CV, VCV, CVVC, Arpasing & VCCV

General VocalSynth Terms

Vocal Synthesizer (Abbr. VocalSynth, VSynth) -

A type of music software that creates virtual singing by entering notes and attaching lyrics to each note.

This is also used as a global term to refer to the communities that create works involving vocalsynths.

Voicebank -

The singers themselves, voicebanks contain the phonetic data or singing data of a person's voice in order to be used in a vocal synthesizer.

Voicebanks are also referred to as "Models", "Libraries", "Sound Sources" or "Databases", which are more formal terms that describe voicebanks to those outside of the community.

In essence, a voicebank is akin to combining recordings of a real person's voice to create a new vocal phrase.

These voicebanks are frequently given fictional avatars and identities to separate the voicebank from the original voice provider.

eg. "Hatsune Miku" is a fictional character using voice data sampled from Japanese voice actress Saki Fujita.

"Standard" Voicebank -

"Standard" is a simplified term to describe voicebanks created using the

"Concatenative Synthesis" method.

This means the voicebank creates audio by directly sourcing from the voice recordings, these recordings are often of the voice provider pronouncing a string of various sounds and phonemes in a flat, monotone manner to provide a clean, baseline voice.

The general consensus of Standard Voicebanks is they produce a clean audio quality, can be created regardless of a voice provider's singing ability and allows figures like voice actors to easily personalise the voice to their liking, but as the audio data is not of realistic singing, the vocals can often sound more artificial and require more editing in the software to sound expressive - Additionally, the means of producing a Standard Voicebank is larger in scale and has more of a learning curve for both the production team and the voice provider, so while it's the flagship method of vocal synthesis, it has since the 2020s dropped in utilisation in the commercial industry.

AI Voicebank -

Voicebanks created using AI Synthesis. Rather than directly sourcing phonetic recordings like in Standard Banks, AI Banks involve training an AI to completely reconstruct a person's singing voice using recordings of them singing. AI banks themselves do not actually contain the recordings in full and instead source from an AI Model that replicated the voice to capture that voice provider's typical delivery and pronunciation when singing.

AI Banks typically come in two forms, there are "Manual" variants that behave like Standard Banks, requiring a user to manually enter the notes, lyrics and voice dynamics and "Automatic" variants that act as a voice changing filter, taking the voice of a source audio and replacing it with the voice of the AI model.

The general consensus of AI Voicebanks is they provide more realistic and expressive vocals and the means of creating an AI bank is more streamlined, especially for voice providers who just have to provide singing data, but sourcing from an AI model rather than directly from the audio leads to a loss in clarity and clearness from bank to bank.

There are also some ethical concerns regarding AI banks, especially Automatic AI banks as theoretically any vocal recording can be used as training data and the lack of conventional workflow has lead to it being compared to the likes of AI Art Generation.

Manual AI Banks however are generally held in a high regard as they require the same manual input from the user as concatenative voicebanks to produce satisfying work, and most commercial distributors of Manual AI Banks work with voice providers who gave their consent and were rightfully compensated.

Append / Expression -

Additional voicebank sets of the voice provider in a different tone of voice, this allows a vocalist to be utilised in a variety of tones and genres, or to add a dynamic, emotional vibe to a vocal track.

For example, Hatsune Miku's VOCALOID4 voicebank, "Hatsune Miku V4x" bundles with ORIGINAL, SOFT, SOLID, DARK & SWEET.

In Standard banks, this is accomplished by swapping to entirely new sample sets, while in AI banks, this is done by training the AI to capture other expressions heard in the training data.

Cross-Synthesis (XSY) -

An Expression feature previously exclusive to VOCALOID4 that allows users to blend two expressions of one vocalist together and configure the easing between the two for a more dynamic vocal that changes over the course of a phrase. This feature was removed in VOCALOID5 but has since returned to VOCALOID6 via. a post-launch software update.

Cross-Lingual Synthesis (XLS) -

A language feature introduced in the Pro edition of Synthesizer V Studio, XLS allows users with purchased AI banks to output vocals in any of SynthV's supported languages regardless of the original language the training data was recorded in (As of writing, supported languages are English, Japanese, Mandarin, Cantonese & Spanish).

A similar feature was introduced in VOCALOID6, with VOCALOID AI banks being capable of singing in English, Japanese and Mandarin Chinese.

Examples of Vocal Synthesizers

VOCALOID -

A commercial vocalsynth created by YAMAHA Corporation and originally released in 2004, the software has since received many upgrades over the years, including the VOCALOID 6 Editor (2022).

VOCALOID is seen by many as the originating point of many vocalsynth developers and vocalists and was the original software of many iconic vocalsynth mascots such as Hatsune Miku and GUMI, who originated on the VOCALOID 2 Editor in 2007 and 2009 respectively.

While in a modern context VOCALOID no longer has the significant community focus or modern tech advancements other vocalsynth competitors have, many versions of the software and its vocalists still see use to this day, to the point where "Vocaloid" is used by many as a universal term to describe vocalsynths and vocalsynth mascots, including those who have never featured on the software.

VOCALOID voicebanks were dominantly Standard/Concatenative up until the introduction of the VOCALOID 6 Editor which introduced VOCALOID AI voicebanks.

UTAU -

UTAU is an independently created shareware vsynth by programmer Ameya that released in 2008, its primary appeals are not only that it's entirely free-to-download but while other major vocalsynths are commercial software that only implement new vocalists and features under first and third party licensing, UTAU allows its community to create and freely distribute new vocalists, vocal types and plugins from scratch (including the voicebanks that are distributed on CafeSynth).

Thanks to these community efforts, UTAU has vastly grown in functionality, but its customisable nature and archaic interface does mean it has a slightly steeper initial learning curve for newcomers.

UTAU voicebanks are Standard/Concatenative, but UTAU itself can be used as an interface to use some AI vocalsynths that do not have interfaces in of themselves, such as NNSVS, allowing UTAU community members to create their own AI Banks.

OpenUTAU -

OpenUTAU is a third-party, open-source successor to UTAU made for the purpose of modernising UTAU's interface and functions, its main exclusive functions include the ability to make multiple vocal tracks in one project file and Phonemizers, scripts that automatically convert notes in a sequence so they are compatible with the voicebank used, negating the need for the user to learn the unique workflow of each bank type.

OpenUTAU sees frequent praise, support and community attention for kickstarting a new era of UTAU usage, but its differences and quirks mean there is still a lot of appeal and utility in using the classic editor, especially when using voicebanks and plugins that were not made with OpenUTAU in mind.

OpenUTAU uses Standard/Concatenative UTAU voicebanks, but can also act as an interface for AI banks from backends like NNSVS and DiffSinger, OpenUTAU is often regarded as the go-to editor for community-made AI banks.

Synthesizer V -

Synthesizer V was created by Dreamtonics, originally released in August 2018 as a technical preview (referred to by fans as "R1" or "Gen 1"), in December 2018 as an official release ("Synthesizer V") and has since been upgraded as "Synthesizer V Studio" in 2020 - on December 2024, Dreamtonics then announced a new successor, "Synthesizer V Studio 2".

SynthV is currently recognised as the most "cutting-edge" and easiest to use vsynth in the industry, receiving frequent significant updates and features not yet seen on other synths at the time, including the ability to make every AI vocalist available multilingual, regardless of the language data the bank originally contained.

The Studio editor has a Basic and a Pro edition, with the Basic edition being free to download and use for an unlimited time with some restrictions, such as a maximum of 3 tracks a project and blocking major features like the previously mentioned Cross Lingual Synthesis.

Synthesizer V started with a range of Standard/Concatenative banks, but since the introduction of Synthesizer V AI in 2020, the majority of the SynthV bank range are Manual AI.

Piapro Studio / NT -

Piapro is the franchise name that refers to all software and vocalists made by Crypton Future Media, the creators of Hatsune Miku, Kagamine Rin, Kagamine Len, Megurine Luka and the license holders of KAITO and MEIKO.

Piapro Studio refers to Crypton's line of vocalsynth editors, with "Piapro Studio V4x" being an alternate VOCALOID-based editor for VOCALOID voicebanks and "Piapro Studio NT" being their own proprietary vocalsynth built from the ground up for their NT line of voicebanks (eg. Hatsune Miku NT).

Piapro Studio V4x sees some utilisation as an effective alternate interface for using VOCALOID, but Piapro Studio NT received mixed reception on release as the engine vastly changed the tone of Crypton's vocalists, making them sound vastly different to their VOCALOID counterparts.

With time, Piapro Studio NT has received performance and stability updates, improving the clarity of its vocalists.

Hatsune Miku NT is currently the only vocalist for Piapro Studio NT, which is a concatenative bank that since its 2.0 update, utilises AI to improve its output.

CeVIO -

(Pronounced "cheh-viy-ow")

CeVIO is a predominantly Japanese line of vocalsynth software made by CeVIO Team to assist in user-generated content, the primary appeal of CeVIO being it's wide array of characters and providing voicebanks for text-to-speech as well as virtual singing.

CeVIO has a range of different editors for different voice types and purposes, modern examples being CeVIO AI and Voisona (Previously announced as "CeVIO PRO").

CeVIO is often considered to be a more niche competitor in the commercial vocalsynth industry, especially when compared to the likes of Synthesizer V Studio, but it still sees use, especially from Japanese audiences and serves as the origin point of many well known vocalsynth mascots like Kafu, Chis-A and SELENA.

VOCALO CHANGER -

VOCALO CHANGER is a utility in the VOCALOID 6 Editor that can be purchased as a separate plugin, it can be used with VOCALOID AI voicebanks and allows them to be used as Automatic, Voice Filter AI banks, overwriting a source audio with the singing voice of the AI model, this allows for immediate results, at the cost of not being modifiable and the output quality depending on the initial quality of the source audio.

Vocoflex -

Vocoflex is an alternate vocal synthesizer by Dreamtonics that utilises automatic, voice filter AI synthesis as opposed to the Manual workflow of Synthesizer V.

Vocoflex allows users to input audio and have it played back to them using a voice entirely generated from scratch using its parameters, or using a model trained from the user's own imported voice samples.

Users can also use it as an effect plugin, allowing them to use vocal phrases made in Synthesizer V as the input utterance, broadening the characteristics that can be accomplished with one voicebank.

Vocoflex is also known for its extensive safeguards to ensure it is used ethically, including an ID verification system before a user can install the software as well as a watermarking system that allows Dreamtonics to identify the creator of an unethical Vocoflex render.

Relevant Linguistic Terms

Phoneme -

Any and all units of sound in a language, such as "f", "ow" and "n".

Using vocalsynths in an advanced context involves utilising phonemes to specify to the editor how you want a word or syllable to be pronounced by the vocalist, different voicebanks and editors will use different phoneme encoding sets that serve different purposes.

Dictionary -

Dictionary systems are a feature in commercial vsynths that allow users to enter lyrics and have the software automatically allocate the phonemes for them.

This allows for a faster, easier workflow, but provides less specification than manual phoneme allocation.

Vowel -

Voiced units of sound that perpetually open the speaker's voice, such as "ah", "eh", "iy", "ow" and "uw".

Consonant -

Unvoiced units of sound where pronouncing it requires the speaker to close their voice, such as "p", "b" and "d".

Diphone -

A pair of phonemes in a syllable, such as "p ah", "b ah" and "d uh".

Triphone -

A trio of phonemes in a syllable, such as "p r ah" and "b l ah"

Diphthong -

A sound formed by combining two vowels into a single phoneme, requiring the speaker's mouth to change shape as they pronounce it, for example, the word "Play" uses the diphthong "ey" formed by combining "eh" and "iy".

Romaji -

The transcription of Japanese syllables and words from Japan's text system "Kana" into roman characters, so they can be easily read by speakers of romanised languages, such as English.

Romaji is also used as a phoneme system for Japanese in editors like Synthesizer V.

IPA -

The International Phonetic Alphabet. IPA encapsulates all posible phonemes of all languages and other additional speech qualities.

While no widely used vocalsynths use IPA due to its complexity, it is frequently utilised by enthusiasts and experts to specify and demonstrate a very specific phoneme, language or dialect in a universal context, independent of any specific phoneme system.

Example: "IPA" in IPA is written as "[aɪ pʰiː eɪ]"

X-SAMPA -

Another multi-language encoding system that allows most IPA phonemes to be typeable on an ASCII keyboard, one of the most recent uses of X-SAMPA in vocalsynths is its role as the encoding system for Chinese voicebanks in Synthesizer V and multiple language methods in UTAU.

While X-SAMPA still has its own complexities similar to IPA, it's often praised for its coverage of multiple languages and streamlining IPA into a more useable form.

Example: "X-SAMPA" in X-SAMPA is written as "Eks-sVmpV"

VOCALOID SAMPA -

VOCALOID's own proprietary encoding system loosely based on X-SAMPA, this system is used across VOCALOID banks of all languages.

Example: "VOCALOID SAMPA" in VOCALOID SAMPA is written as "v@UkVlOId sVmpV"

ARPABET -

An English-exclusive encoding system designed to represent phonemes in the General American English dialect, ARPABET sees frequent use in modern day vsynths, such as being the encoding system for English voicebanks in Synthesizer V and CeVIO, presumably because it's a smaller encoding system with simpler codes that are easier to remember, but being designed around one language does make its utility limited.

Example: "ARPABET" in ARPABET is written as "aarpahbeht"

CZ Phonemes -

Also nicknamed "CZSampa".

CZ Phonemes are an English encoding system designed for use in UTAU by community member PaintedCz in their English voicebank method known as "VCCV".

The design philosophy of CZ Phonemes is to design phonemes that look how they sound.

This system is used exclusively with this voicebank type in UTAU and has no mainstream, commercial usage in other mainstream vocalsynths.

Example: "CZ Phonemes" in CZ Phonemes is written as "sEzE fOnEmz"

Production Terms &
File Extensions

Tuning -

The act of manipulating the pitch and cadence of a vocalsynth output to sound more expressive and melodic, two well known forms of tuning include "Notebending", which involves splitting a note into multiple smaller notes to flow into eachother through use of portamento and "Pitchbending", whereby the user manually draws the changes in pitch onto the project to add finer details to the vocals.

Mixing -

The process of compiling all of the aspects of a song or cover into a finished form.

Mixing involves placing all of the music and vocal tracks together and using different effects and plugins so the tracks "blend" into a singular work.

Translyrics -

The act of taking a translation of a song's lyrics into another language and reformatting the translation into a singable form, this process involves rewriting the translation with different words and syllable counts so they flow like conventionally written lyrics while still holding their original thematic meaning.

Plug & Play -

A slang term in community circles that refers to the act of obtaining another user's project file for an editor, swapping the vocalist used, and re-rendering the project while making few or no edits unique to the other user.

This process has utility and is endorsed by some who distribute their project files, but is also prohibited and frowned upon by others, especially if credit isn't provided to the original creator of the file.

Vocal Morphing -

The process of synthesizing two takes by two separate vocalists into one unified take containing the characteristics of both.

The most common use of vocal morphing in vocalsynth spaces is morphing an old standard voicebank with a modern AI voicebank to improve the vocal quality of the former.

This process has lots of utility and sees some niche use, but does occasionally garner controversy with regards to the disclosure of which vocalists are used and whether or not it violates the terms of use of either vocalist.

MIDI -

A file containing note data, this type is used across many different virtual instruments and music interfaces. MIDIs can even be imported into vocal synthesizers to immediately notate a project file in the user's editor of choice.

VOCALOID MIDI -

Project files for VOCALOID 1.

VSQ -

Project files for VOCALOID 2.

VSQX -

Project files for VOCALOID 3 & 4.

VPR -

Project files for VOCALOID 5 & 6.

UST -

Project files for UTAU

USTX -

Project files for OpenUTAU

JSON -

Project files for Synthesizer V R1

S5P -

Project files for Synthesizer V (2018 Production Release)

SVP -

Project files for Synthesizer V Studio

UTAU-Specific Terminology

Flag -

UTAU's parameter system to customise voices, rather than using the GUI like in other editors, flags are added to project files and notes in UTAU using text, such as "g-2MT50P30" (Sets Gender Factor to -2, Tension to 50 and Peak Compressor to 30)

Sampler -

A sampler is a renderer UTAU runs to create its vocals, UTAU allows the user to swap between different samplers to change how the final render sounds.

The default sampler UTAU comes with is called "Resampler" while examples of other samplers include "Freesamp", "TIPS" and "Moresampler" among others.

Voicebank Type/Method -

UTAU was originally built with just Japanese voicebanks in mind, but thanks to the open workflow the software provides for users, there are a wide array of voicebank types and recording methods that allow for different languages and even different outputs of the same language.

Different voicebank types can change not only the lanugage the bank can sing in, but even areas like the length of a recording list, and the complexity of recording, configuring and using the bank.

Reclist -

Recording Lists are the scripts voicers read from when recording concatenative/standard banks for UTAU, they often contain a list of characters, phonetic clusters or even full words to read out in order to capture the required samples for a bank.

Oto -

Refers to "oto.ini", the core configuration file in every UTAU voicebank, this file contains numerical values that specify what section of a recording a sample should use in order for a voicebank to function.

A line in an oto consists of the following parameters:

- The recording file the sample draws from.

- The Alias, the name of the sample that's entered into the note.

- The Offset and Cutoff, where the sample begins and ends in the recording.

- The Preutterance, the point in the recording the note starts on, such as the vowel sound.

- The Overlap, the section of the recording that crossfades with any notes before it to blend the samples together.

- The Consonant, the early fraction of the sample that plays once when the sample begins and never loops, everything after the Consonant up until the Cutoff is called the "Loop Region", which repeats on the sample to account for long sustains.

Pitch -

Pitch in UTAU refers to the amount of keys a reclist has been recorded in, UTAU accomplishes its singing synthesis by pitch correcting a recording to the note the sample is placed in, and the further the recording's base note is from the desired note creates more distortion, so multiple pitches are recorded to give a voicebank a wider, more natural sounding range.

Labelling & Training -

The process of creating an AI voicebank from song data. Each section of the song data's waveform is labelled with the phoneme that's being pronounced for the purpose of giving the AI a reference point, the AI then refers to the labels on the song data and "trains" itself so it can gradually mimic the voice on the song data.

CV (Consonant to Vowel) -

The original Japanese method in UTAU where each recording is a single syllable of the Japanese language, such as "か", "ふ" and "そ".

This method is regarded as the easiest to use in UTAU as you just enter each character into each note, but its lack of complexity can make it sound the choppiest or most digitised.

VCV (Vowel to Consonant to Vowel) -

An intermediate Japanese method where each recording is a string of different syllables to produce triphones that can blend together, such as "a か", "i ふ" and "e そ".

This method is the most widely used for smooth and straightforward Japanese synthesis but doesn't allow as much customisation as other methods with regards to pronunciation and delivery.

CVVC (Consonant to Vowel to Vowel to Consonant) -

An advanced Japanese method that contains not just CV samples, but VC samples as well, such as "a k", "i f" and "e s".

This method allows for small, diphonic banks to produce smooth results akin to VCV, although its often regarded as having a steeper learning curve for both bank designers and users.

Alternatively, a user can use a CVVC bank and not implement the VCs, making it behave exactly like a CV bank.

Arpasing -

An English UTAU method that forms English vocals using Arpabet diphones.

Arpasing is particularly well known for utilising a word-based reclist that's short in length compared to other English methods, making the act of producing an Arpasing bank quick and straightforward, although it has some limitations, such as the original reclist not containing full diphone coverage and the lower volume of community resources compared to other methods.

Along with other English methods in UTAU, Arpasing's biggest complication is the absence of a dictionary system, requiring users to enter and pace the diphones manually.

VCCV -

An English UTAU method that forms English vocals using diphones and triphones of CZ Phonemes.

VCCV is often regarded as the most comprehensive form of English in UTAU thanks to its full coverage of the English language, the scale of its phoneme system to capture the nuances of English phonetics and the large volume of support and learning resources, but its scale also makes it a big investment thanks to its larger reclist and oto compared to other English methods and the lack of a dictionary system.