VoxSort Diarization: who spoke when?

Improves display and playback voice conversations via speaker diarization.

  • VoxSort Diarization may recognize and separate up to 7 speakers, uttering sequentially;
  • Nonspeech sounds detection;
  • VoxSort Diarization may accept voice recordings of most known formats;
  • Extraction sound from videos included;
  • VoxSort Diarization is standalone application: no cloud computing being used;
  • Speaker diarization error rate typically is better than 2% for near studio quality recordings;
  • Evaluation time is typically 350+ times less than duration of sound file;
  • User friendly graphic representation of voice conversation;
  • Simple intuitive user interface;


VoxSort Diarization is a means to improve recorded voice dialogues playback and management. Unique super fast and accurate speaker diarization technology used for the purposes. So, the software splits the sound file into segments (paragraphs) of speech produced by each participant of voice conversation. User friendly graphic representation then used to simplify navigation through the sound and playback it in a set of modes. No speech to text technologies is being implemented.

Everybody who uses voice recording knows a problem: it is not easy to seek to the moment of a dialogue to be recalled -- you need to listen through all the recording in most cases. So very often using dictaphone does not provide that it seemingly should. VoxSort Diarization can help by means of express show of the conversation structure prior to boring process of expensive manual or erronously automated full speech to speakers/text transcription.

Though the program designed for working with live dictaphone recordings first, it might be used for wide range of input: recorded telephone or teleconference conversations, video clips'sound tracks, podcasts of radio and TV shows, etc. A set of audio extracting/converting software modules being implemented in order to provide interface.

The VoxSort Diarization application seems to be lonely end user product of the kind on the market for indefinite period. Known state of the art speaker diarization methods provide result not much faster than real time, i. e. modern PC will evaluate several minutes a tens minutes' voice conversation at best. So, the application based on such methods hardly can be positioned as simple consumer product for express analysis. VoxSort Diarization, while providing competing accuracy, typically does the job in several seconds for half an hour sound file and near immediately for several minutes voice recordings. More on VoxSort technology:


It is expected and used in data flow chart that sound in the input is true conversation record, i. e. the environment (background noise, microphone frequency response, position of speakers) has no sudden changes. A mixture of different source records, zero sound includes can cause enormous error rate up to total workability loss.

VoxSort Diarization is intended to evaluate "near studio quality" speech records. That means, that the voice conversation recorded in silent environment, no echo, no background music, the volume of the sound is properly set, speech of each dialogue participant balanced in terms of volume and quality.

The conversation to evaluate supposed to be disciplined, i. e. all speakers talk calm in normal voices and do not overlap each other. Short replies (< 3 seconds) is a common problem for all diarization methods, if most of the conversation consists of short replies then VoxSort Diarization will fail to provide reasonable result.

As any recognition system (even human) the VoxSort Diarization program sometimes make errors. Assume error rate (ER) as duration of all wrong speaker (or non speech, or vs) assumptions relative to duration of whole sound file. If all the above requirements are met then VoxSort Diarization typically provides better than 2% ER.

"Normal" errors are: short replies missing, voluntary short replies implements, inaccurate interspeaker's border determination. More hard "ubnormal" errors occurs when VoxSort Diarization determined wrong number of conversation participants. If the number of speakers is known, the user can define it manually -- in most cases that many errors will disappear.

The most hard case for the VoxSort Diarization yet are "sketch wise" conversations, where are too many speaker turns, a lot of "wow's", "hey's", "hmm's", laugh, shout and many extra sounds -- music, applauds, furniture moving, etc.. In such cases the VoxSort Diarization may provide a result having little in common with the reality.


Real Time Rate (xRT) means CPU time consumed for the evalution relative to total sound file duration. For demo purposes VoxSort Diarization shows brief computing profile. Typical xRT lies in range of 0.002...0.004, i. e. VoxSort diarization is 250...500 times faster than real time.

Following below shown a result of speed race of VoxSort Diarization and typical nowaday available speaker diarization software -- LIUM package (open source project). For contest 6 random voice conversations were used: panel discussions, online interview, therapist interviews, etc.

file spkrs min:sec VoxSort LIUM Ratio
#1 6 12:04 1.88 147 78x
#2 4 28:47 4.80 813 169x
#3 4 05:35 0.76 128 168x
#4 2 06:14 1.01 71 70x
#5 4 05:47 0.84 58 69x
#6 2 13:19 2.10 347 165x
average ~120x

Columns "VoxSort" and "LIUM" stand for execution time consumed by respective programs in seconds. The contest was performed on same regular i5 @2.4 MHz PC.


Total duration of the voice dialogue in current release of VoxSort Diarization is limited by minimum of 1 minute and maximum of 30 minutes. Indeed, to sort 1 minute record has no sense -- it's easy to listen it all. Half an hour limit is artificial limit for demo purposes: some speed up improvements should be implemented to accept longer records. Also, VoxSort Diarization may miss participant if total amount of his speech contributed is less than 1 minute.

Voxsort Diarization can split conversation into 7 speakers maximum. If there were more participants in the dialogue, then somebody will be merged with the most alike.


December 15, 2018

Beta VoxSort Diarization version 0.71.1. Minor bugs fixed. Test for macOS available. No functionality and/or GUI changes.

December 5, 2017

Beta VoxSort Diarization version 0.71.0. Minor bugs fixed. Accessibility to external sound files added. No functionality and/or GUI changes.

May 5, 2017

Beta VoxSort Diarization version 0.70.0. Improved both reliability and speed by 30...60 %. No functionality and/or GUI changes.

November 11, 2016

First public beta VoxSort Diarization version 0.61.3


Integrated Wave Technologies, Inc.'s diarization technology is based on the company's highly successful speech recognition algorithm used extensively in US Government R&D and systems acquisition programs. DARPA, the Air Force Research Lab, the US Department of Justice's Office of Science and Technology, and the Naval Aerospace Medical Research Lab acquired our language analysis/speech recognition technology for use in combat/field operations. These US Government offices funded extensive R&D and acquisition programs based on our recognition software providing better performance in situations where exceptional noise immunity, low computing power requirements and high accuracy were required. More about IWT solutions:

For mobile (Android) version of VoxSort Diarization software please refer to Google Play Store.

Any questions/notes please refer to feedback@voxsort.com.

Copyright (c) Integrated Wave Technologies, Inc. All rights reserved.

