Saying goodbye to MIDI-A/B modes in this forked repository #67
Closed
yqzhishen announced in Announcements
Replies: 1 comment 1 reply
-
Decoupling from the phoneme durations and f0 predictor sounds really sensible. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free to join this conversation on . Already have an account? Sign in to comment
-
We will perform lots of clean-ups and refactoring in the next version, with breaking changes described in the following release:
https://.com/openvpi/DiffSinger/releases/tag/v1.7.1
Why removing MIDI modes?
MIDI modes do have ability to predict phoneme durations (MIDI-A and MIDI-B) and pitch (MIDI-A) from music score inputs (so-called auto-tuning by some people). However, MIDI modes have many disadvantages:
What is the recommended mode now?
We recommend all users move to MIDI-less mode at this point. This will be the standard mode of DiffSinger acoustic model in the future, with all pipelines and facilities (engines, editors) focusing on it. Its datasets are easy to label and build with provided tools and pipelines, and it has better performance and controllability than MIDI-A/B. Users can also enjoy the latest features, like multi-speaker models, dynamic speaker mix, data augmentation, gender and velocity control, with MIDI-less mode.
Will MIDI modes be completely deleted?
No, at least not for now. There is one disadvantage of MIDI-less mode: it has no ability to predict phoneme durations. The work-around is rhythmizers (
FastSpeech2Encoder
+DurationPredictor
), which is grabbed from MIDI-A mode. Code of MIDI-A/B will be kept in the release above. Also, a branch for MIDI-A/B will be kept although the main branch will advance.We will keep MIDI-A/B (but without maintenance) in this repository until we finish developing better alternatives and the corresponding customized making pipelines to let everyone able to easily prepare and train their own duration/f0/... models.
Will MIDI-related stuff come back in the future?
Absolutely yes! Our goal is to let user simply input music scores and lyrics to generate nice singing voices, so of course we must deal with MIDIs.
However, we will limit the usage of MIDI inputs: it is only used to predict phoneme durations, f0 and other parameters. This will be called variance adaptors or variance models, whose outputs can be directly used by current MIDI-less acoustic models. With this decoupled cascade architecture, we are able to achieve higher flexibility, quality and controllability.
Beta Was this translation helpful? Give feedback.
All reactions