top of page
wixbg_edited.jpg

Operation Guides

Voicebank Types

Before you begin:

- This guide was written and coordinated by members of the UTAU community and has no affiliation with the UTAU software itself or its creator.

- This guide was written with users of WINDOWS 10 in mind, advice does not immediately translate to other operating systems or other versions of UTAU, such as UTAU-Synth for MacOS.

- This guide relates to the original UTAU software by Ameya which released in 2008 and NOT OpenUTAU, the fanmade UTAU alternative, as such the utility of this resource may vary.

- While the process of operating UTAU is ultimately safe when done correctly, JOEZCafe and other parties involved in JOEZUTAU projects take no responsibility for any incidents, loss or damage to users or property from following these instructions.

What are Voicebank Types?

UTAU is very open and versatile software, as such, no two voicebanks function the exact same way, some banks sing in different languages, and some languages have multiple methods of producing virtual singing.

This can have an effect on how it feels to use the bank, how the results will often sound and if the voicebank can read the lyrics contained in the notes of a project file.

There isn't necessarily a default voicebank type that is recommended for all newcomers as each one contains its own levels of complexity, so experimentation is recommended to find the right workflow for you.

​

JOEZUTAU distributes voicebanks created using the following methods:

- Consonant-Vowel Japanese (CV)

- Vowel-Consonant-Vowel Japanese (VCV)

- Consonant-Vowel-Vowel-Consonant Japanese (CVVC)

- Arpasing English

​

As such, we will be covering them here!

​

CV Japanese

CV is regarded as the most common voicebank type for newcomers with the most straightforward workflow.

CV Japanese UTAU banks are accomplished with each syllable in the Japanese language utilising its own individual recording, this is how the bank type gets its name, because every sample is either a standalone vowel sound, or a consonant followed by a vowel.

1.PNG

CV is considered the ideal voicebank type to use for beginners thanks to its biggest strength of less visual clutter.

Making UST files with CV is as simple as entering a single Hiragana character into a note to use that character's exclusive recording, with the tradeoff being that compared to more complex voicebank methods, CV can often sound choppier and more robotic, as it doesn't involve working with inbetween samples that blend these notes together, so smoothing out and improving the results can potentially be more time consuming depending on the results the user desires.

2.PNG

VCV Japanese

VCV is the most popular voicebank method for intermediate users as it can quickly and easily produce smooth results once the user is comfortable and familiar with UTAU's workflow and interface.

​

VCV uses recordings of a series of Japanese syllables being spoken one after another to produce Japanese CV syllables that also have a Vowel pronounced before transitioning to the Consonant, this vowel sound then overlaps with the previous note in the sequence to blend these notes across each other, providing smoother pronunciation and delivery in UTAU.

3.PNG

The above diagram is a recording that produces (at minimum) 7 samples that can be used in UTAU.

[- ka], [a ka], [a ki], [i ka], [a ku], [u ke], [e ka]

​

Unlike CV, which involves entering standalone characters in Kana, VCV involves additionally entering these overlap vowels using Roman characters (Or entering a hyphen ( - ) when starting a sequence without an overlap vowel) to accomplish smoother Japanese vocals.

4.PNG

VCV Japanese can be used by beginners once the initial learning curve of entering the lyrics correctly has been surpassed, but some users will find CV Japanese more accessible to learn the interface at a more gradual pace.

​

VCV's primary weakness is only prevalent with Intermediate and Advanced users, where the vowel overlaps being "baked into" the samples makes pronunciation and delivery harder to customise beyond the initial result.

Ultimately, VCV is a "one and done" method of Japanese that quickly and easily produces smooth results at the cost of less vocal customisation for advanced users.

CVVC Japanese

CVVC is a blend of CV and VCV that allows banks to be operated using the traditional CV method by beginners, but contains additional, optional samples for advanced users to overlap and smoothen the output.

​

CVVC uses recordings of Japanese syllables that share the same Consonant sound one after the other to produce two types of samples:

​

CVs - Samples of known Japanese syllables, these are named with their respective characters in Hiragana and operate the same way as using a CV voicebank.

​

VCs - Separate, individual samples of the voice provider holding a Vowel sound and then transitioning to a Consonant, these are Overlap Samples that can be optionally placed between notes to smoothen the output and are always encoded in Roman lettering.

6.PNG

To summarise, CVVC voicebanks primarily operate the same way as CV voicebanks by entering Hiragana into notes, but the user has the option of manually placing VC samples with Roman lettering to produce a result that sounds similar to VCV, but from scratch, and with the ability to readjust, retime and space out the VCs to make custom pronunciations and deliveries.

5.PNG

Newcomers to UTAU can absolutely use CVVC voicebanks, even to just use them as standard CV voicebanks, but the additional features and samples to make them sound as smooth as VCV does take some intermediate knowhow of the software.

Arpasing English

The UTAU software was made with the Japanese language in mind, but through use of its tools and workflow, synthesizing other languages is possible using a range of methods designed by a variety of group efforts in the UTAU community.

Arpasing (Portmanteau of ARPABET and Sing) is one of many methods of making English vocals in UTAU and it's the dominant English method used in the voicebanks distributed on JOEZUTAU.

​

The voicebank type uses ARPABET, a form of English phonetic transcription that categorises each consonant and vowel in the Common American English dialect with their own uniquely identifying codes, users may recognise ARPABET as the phonetic system used with English voicebanks in the vocal synthesizers Synthesizer V and CeVIO.

​

7.PNG

Arpasing uses recordings of a series of English words and syllables being spoken one after another to produce a variety of diphones (Two phonemes used together, like "k aa" or "ao f".

​

8.PNG

The biggest obstacle of Arpasing (And all methods of English in UTAU) is

the lack of a dictionary system.

In vocal synthesizers, a dictionary system is a database that reads English words entered into notes to automatically retreive the needed phonemes and space them accordingly.

Due to the lack of a dictionary, using English voicebanks in UTAU requires manually entering phonetic diphones and formatting and spacing those diphones by hand to produce satisfying results, similar to CVVC Japanese.

9.PNG

​Arpasing is only recommended for Intermediate and advanced users of UTAU who are confident with its interface, although it does contain a series of tools and methods that effectively streamline its workflow after some practice and committment.

Guides for all of these voicebank types are available on the tutorials homepage, but before those, you should get started on how to make a base UST!

bottom of page