PhonemeConverter is a C++ library for converting pronunciation strings into phoneme sequences and marking phoneme onsets. It provides simple built-in converters, file-driven converters, JSON rule-based onset marking, and LuaJIT-backed scripting hooks for custom behavior.
- Split space-separated pronunciation strings into phoneme sequences.
- Convert phonemes through tab-separated mapping files.
- Convert whole pronunciation keys through tab-separated dictionary files.
- Mark phoneme onsets with JSON pattern rules.
- Implement custom conversion and onset marking logic with Lua scripts.
- Build as a shared library by default, or as a static library with a CMake option.
- CMake 3.17 or newer
- A C++ compiler with C++20 support for the library build
- PkgConfig
- LuaJIT
- nlohmann/json
- stdcorelib
- GoogleTest, only when building tests
DirectS2P splits a pronunciation string by spaces.
#include <PhonemeConverter/DirectS2P.h>
#include <string>
#include <vector>
std::vector<std::string> phonemes = PhonemeConverter::DirectS2P::convert("AA BB CC");
// {"AA", "BB", "CC"}MappingS2P splits a pronunciation string by spaces and applies a tab-separated phoneme mapping. Known phonemes are replaced, while unknown phonemes are kept unchanged.
#include <PhonemeConverter/MappingS2P.h>
#include <sstream>
#include <string>
#include <vector>
std::istringstream mappingFile("AA\ta\nBB\tb\n");
PhonemeConverter::MappingS2P converter(mappingFile);
std::vector<std::string> phonemes = converter.convert("AA CC BB");
// {"a", "CC", "b"}Mapping file format:
AA a
BB b
Each non-empty line must contain exactly one tab, with a non-empty source phoneme and a non-empty target phoneme. Duplicate source phonemes are rejected.
DictionaryS2P maps a whole pronunciation key to a configured phoneme sequence. Unknown keys convert to an empty sequence.
#include <PhonemeConverter/DictionaryS2P.h>
#include <sstream>
#include <string>
#include <vector>
std::istringstream dictionaryFile("hello\tHH AH L OW\nnihao\tn i h ao\n");
PhonemeConverter::DictionaryS2P converter(dictionaryFile);
std::vector<std::string> phonemes = converter.convert("hello");
// {"HH", "AH", "L", "OW"}Dictionary file format:
hello HH AH L OW
nihao n i h ao
Each non-empty line must contain exactly one tab, with a non-empty key and a space-separated phoneme sequence. Duplicate keys and empty phonemes inside a sequence are rejected.
LuaS2P executes a Lua function named s2p. The function receives the pronunciation string and must return a table of strings.
function s2p(pronunciation)
local result = {}
for phoneme in string.gmatch(pronunciation, "[^ ]+") do
result[#result + 1] = string.lower(phoneme)
end
return result
end#include <PhonemeConverter/LuaS2P.h>
#include <PhonemeConverter/LuaScript.h>
#include <string>
#include <vector>
PhonemeConverter::LuaScript script(/* Lua script */, "s2p-script");
PhonemeConverter::LuaS2P converter(script);
std::vector<std::string> phonemes = converter.convert("AA BB CC");
// {"aa", "bb", "cc"}RuleOnsetMarker loads JSON rule definitions. A definition contains phoneme type assignments and ordered pattern rules.
{
"phonemeTypes": {
"ae": "vowel",
"ah": "vowel",
"ey": "vowel",
"ow": "vowel",
"b": "consonant",
"f": "consonant",
"k": "consonant",
"l": "liquid",
"r": "liquid",
"y": "liquid"
},
"rules": [
{ "pattern": ["vowel"], "onsets": [0] },
{ "pattern": ["consonant", "liquid", "vowel"], "onsets": [1] },
{ "pattern": ["liquid", "liquid", "vowel"], "onsets": [1] }
]
}#include <PhonemeConverter/RuleOnsetMarker.h>
#include <sstream>
#include <string>
#include <vector>
std::istringstream rules(/* JSON rules */);
PhonemeConverter::RuleOnsetMarker marker(rules);
std::vector<bool> onsets = marker.mark({"b", "r", "ih", "l", "y", "ax", "n", "t"}); // brilliant
// {false, true, false, false, true, false, false, false}Rules use phoneme type names in pattern. The wildcard pattern item "*" matches any phoneme. When multiple rules can match, the marker chooses the longest matching rule and prefers typed rules over wildcard rules of the same length. For example:
{
"phonemeTypes": {},
"rules": [
{ "pattern": ["*"], "onsets": [0] },
{ "pattern": ["*", "*"], "onsets": [1] }
]
}LuaOnsetMarker executes a Lua function named markonset. The function receives a table of phoneme strings and must return a table of booleans with the same length.
function markonset(phonemes)
local result = {}
for i = 1, #phonemes do
result[i] = i == 1 or phonemes[i] == "T"
end
return result
end#include <PhonemeConverter/LuaOnsetMarker.h>
#include <PhonemeConverter/LuaScript.h>
#include <string>
#include <vector>
PhonemeConverter::LuaScript script(/* Lua script */, "onset-script");
PhonemeConverter::LuaOnsetMarker marker(script);
std::vector<bool> onsets = marker.mark({"S", "AA", "T"});
// {true, false, true}The library reports invalid input formats and Lua failures with typed exceptions:
MappingS2PParseErrorDictionaryS2PParseErrorRuleOnsetMarkerParseErrorLuaScriptErrorLuaS2PErrorLuaOnsetMarkerError
PhonemeConverter is licensed under the Apache License 2.0. See LICENSE for details.