Hello Craig !
Thank you for posting on Microsoft Learn.
For short clips (<15s each), you need to choose Individual utterances + matching transcript. You upload one ZIP of .wav files and one .txt transcript file (not zipped). https://learn.microsoft.com/en-us/azure/ai-services/speech-service/professional-voice-create-training-set
For long recordings, choose Long audio + transcript. You upload one ZIP of audio and a second ZIP of per-file .txt transcripts (each .txt has the same filename as its audio). https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-voice-training-data
If you’re doing individual utterances + matching transcript :
If you want Audio ZIP: put only .wav files at the root (no subfolders) and use unique filenames, Windows-safe characters (no \ / : * ? " < > |, not starting/ending with space). https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-voice-training-data
PCM WAV, ≥16 kHz (24 kHz recommended), 16-bit, <15s per file. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-voice-training-data
If you want transcript .txt (a single file), one line per utterance:
0000000001<TAB>This is the waistline, and it's falling.
0000000002<TAB>We have trouble scoring.
0000000003<TAB>It was Janet Maslin.
You can use a real tab character between the audio ID and the text (spaces or commas will fail). https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-voice-training-data
The audio ID must match the .wav filename (usually the base name, without .wav) and is expected to be numeric. Duplicates are rejected. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/professional-voice-create-training-set
The encodings allowed are UTF-8/UTF-8-BOM/UTF-16-LE/UTF-16-BE/ANSI/ASCII (zh-CN doesn’t allow ANSI/ASCII). https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-voice-training-data