15 Commits

Author SHA1 Message Date
M Clark 9e212125b1 insert takes 2 arguments
In python 3.4 and 2.7 I get an error  "insert takes 2 arguments". This PR fixes it, but I had to assume it was intended to be an append.
2016-10-20 08:27:40 +08:00
Tim Mahrt 1b1903bc0b BUGFIX: Dental and Stop special keys don't match multichar sounds like tʃ
So 't' will be matched but not 'tʃ'
2016-07-20 16:35:44 +02:00
Tim Mahrt 5e64deebe6 BUGFIX: Removed diacritics from strings while searching
Unless the user is explicitly searching for the diacritic.

Also, added some more documentation.
2016-07-18 17:10:22 +02:00
Tim Mahrt bce3c8ff23 BUGFIX: Protect () and [] in searches 2016-07-18 17:08:02 +02:00
Tim Mahrt 4056b105c9 BUGFIX: Monophthongs searches no longer match dipthongs
This was in the code but the functionality didn't work.
2016-07-18 17:06:50 +02:00
Tim Mahrt d88ff7d8d9 FEATURE: Updated to new isledict format. Now using unicode IPA
It made the code a little more complex and now the system
is less typing friendly but is more intuitive (no more guessing
how to pronounce a character).

Update includes changes to documentation.
2016-07-16 00:49:45 +02:00
Tim Mahrt 4cc4bf85ec DOCUMENTATION: Formatting fix 2016-07-09 17:26:47 +02:00
Tim Mahrt 4c1a26ed03 DOCUMENTATION: Ready for release v1.4 2016-07-09 17:23:44 +02:00
Tim Mahrt ac8643678b REFACTOR: Names follow pep008 2016-07-09 17:23:18 +02:00
Tim Mahrt b76454f626 FEATURE: Added searching of words by pronunciation
Based on regular expressions with some keyword parameters
to simplify search queries.  A set of examples is
provided.
2016-07-09 17:22:12 +02:00
timmahrt ea0bc5c5cd BUGFIX: Unicode error in installation file python 2.x 2016-07-08 00:19:35 +02:00
timmahrt 5d70367bfc BUGFIX: OS X default encoding with io.open is whack
I'm changing everything to utf-8
2016-06-29 23:58:38 +02:00
Tim Mahrt 81257bdfaf BUGFIX: Brought example file up-to-date with code 2016-06-28 20:27:35 +02:00
Tim Mahrt 88f79d63e8 BUGFIX: 'U' mode depreciated in open in favor of io.open
'U' is universal line support mode, which is the default mode
in io.open
2016-06-28 20:23:29 +02:00
Tim Mahrt 2dcb92217d BUGFIX: ASCII installation files with unicode caused PIP problems 2016-03-22 12:21:21 +01:00
8 changed files with 340 additions and 27 deletions
+25 -10
View File
@@ -31,7 +31,7 @@ What can you do with this library?
- map an actual pronunciation to a dictionary pronunciation (can be used - map an actual pronunciation to a dictionary pronunciation (can be used
to automatically find speech errors):: to automatically find speech errors)::
pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['kh', 'ae',]) pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['k', 'æ',])
- automatically syllabify a praat textgrid containing words and phones - automatically syllabify a praat textgrid containing words and phones
(e.g. force-aligned text) -- requires my (e.g. force-aligned text) -- requires my
@@ -39,10 +39,23 @@ What can you do with this library?
pysle.syllabifyTextgrid(isleDict, praatioTextgrid, "words", "phones") pysle.syllabifyTextgrid(isleDict, praatioTextgrid, "words", "phones")
- search for words based on pronunciation::
e.g. Words that start with a sound, or have a sound word medially, or
in stressed vowel position, etc.
see /tests/dictionary_search.py
Major revisions Major revisions
================ ================
Ver 1.4 (July 9, 2016)
- added search functionality
- ported code to use the new unicode IPA-based isledict
(the old one was ascii)
Ver 1.3 (March 15, 2016) Ver 1.3 (March 15, 2016)
- added indicies for stressed vowels - added indicies for stressed vowels
@@ -64,12 +77,14 @@ Requirements
================ ================
- Before you use this library (before or after installing it) you will need - Before you use this library (before or after installing it) you will need
to download the ILSEX dictionary. It can be downloaded here: to download the ILSEX dictionary. It can be downloaded here under the
section 'English' linked under the text 'English Pronlex'
(with a file name of ISLEdict.txt):
`ISLEX project page <http://www.isle.illinois.edu/sst/data/dict/>`_ `ISLEX project page <http://isle.illinois.edu/sst/data/g2ps/>`_
`Direct link to the ISLEX file used in this project `Direct link to the ISLEX file used in this project
<http://www.isle.illinois.edu/sst/data/dict/islex/islev2.txt>`_ (islev2.txt) <http://isle.illinois.edu/sst/data/g2ps/English/ISLEdict.txt>`_ (ISLEdict.txt)
- ``Python 2.7.*`` or above - ``Python 2.7.*`` or above
@@ -103,7 +118,7 @@ Here is a typical common usage::
from pysle import isle from pysle import isle
isleDict = isle.LexicalTool('C:\islev2.dict') isleDict = isle.LexicalTool('C:\islev2.dict')
print isleDict.lookup('catatonic')[0] # Get the first pronunciation print isleDict.lookup('catatonic')[0] # Get the first pronunciation
>> [['kh', '@,'], ['t_(', '&'], ['th', "A'"], ['n', 'I', 'kh']] [2] >> [['k', 'ˌæ'], ['t˺', 'ə'], ['t', 'ˈɑ'], ['n', 'ɪ', 'k']] [2, 0]
and another:: and another::
@@ -111,7 +126,7 @@ and another::
from psyle import pronunciationTools from psyle import pronunciationTools
searchWord = 'another' searchWord = 'another'
anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced (ASCII or IPA ok here)
returnList = pronunciationTools.findBestSyllabification(isleDict, returnList = pronunciationTools.findBestSyllabification(isleDict,
searchWord, searchWord,
@@ -128,7 +143,7 @@ Citing pysle
Pysle is general purpose coding and doesn't need to be cited Pysle is general purpose coding and doesn't need to be cited
(you should cite the (you should cite the
`ISLEX project <http://www.isle.illinois.edu/sst/data/dict/islex/index.shtml>`_ `ISLEX project <http://isle.illinois.edu/sst/data/g2ps/>`_
instead) but if you would like to, it can be cited like so: instead) but if you would like to, it can be cited like so:
Tim Mahrt. Pysle. https://github.com/timmahrt/pysle, 2016. Tim Mahrt. Pysle. https://github.com/timmahrt/pysle, 2016.
@@ -139,7 +154,7 @@ Acknowledgements
Development of Pysle was possible thanks to NSF grant **IIS 07-03624** Development of Pysle was possible thanks to NSF grant **IIS 07-03624**
to Jennifer Cole and Mark Hasegawa-Johnson, NSF grant **BCS 12-51343** to Jennifer Cole and Mark Hasegawa-Johnson, NSF grant **BCS 12-51343**
to Jennifer Cole, José Hualde, and Caroline Smith, and to Jennifer Cole, José Hualde, and Caroline Smith, and
to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
funded by the Investissements dAvenir French Government program, managed funded by the Investissements d'Avenir French Government program, managed
by the French National Research Agency (ANR). by the French National Research Agency (ANR).
+237 -6
View File
@@ -1,11 +1,33 @@
#encoding: utf-8
''' '''
Created on Oct 11, 2012 Created on Oct 11, 2012
@author: timmahrt @author: timmahrt
''' '''
import io
import re
vowelList = ['a', '@', 'e', 'i', 'o', 'u', '^', '&', '>', ]
charList = [u'#', u'.', u'', u'b', u'd', u'', u'ei', u'f', u'g',
u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'', u'p',
u'r', u's', u't', u'', u'u', u'v', u'w', u'z', u'æ',
u'ð', u'ŋ', u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
u'ɪ', u'ɵ', u'ɹ', u'ʃ', u'ʊ', u'ʒ', u'æ', u'ʌ', ]
diacriticList = [u'˺', u'ˌ', u'̩', u'̃', ]
vowelList = [u'', u'ei', u'i', u'', u'u', u'æ',
u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
u'ɪ', u'ʊ', u'ʌ', ]
def isVowel(char):
return any([vowel in char for vowel in vowelList])
def sequenceMatch(matchChar, searchStr):
return matchChar in searchStr
class WordNotInISLE(Exception): class WordNotInISLE(Exception):
@@ -30,7 +52,8 @@ class LexicalTool():
Builds the isle textfile into a dictionary for fast searching Builds the isle textfile into a dictionary for fast searching
''' '''
lexDict = {} lexDict = {}
wordList = [line.rstrip('\n') for line in open(self.islePath, "rU")] with io.open(self.islePath, "r", encoding='utf-8') as fd:
wordList = [line.rstrip('\n') for line in fd]
for row in wordList: for row in wordList:
word, pronunciation = row.split(" ", 1) word, pronunciation = row.split(" ", 1)
@@ -60,6 +83,214 @@ class LexicalTool():
return pronList return pronList
def search(self, matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok'):
return search(self.data.items(), matchStr, numSyllables=numSyllables,
wordInitial=wordInitial, wordFinal=wordFinal,
spanSyllable=spanSyllable,
stressedSyllable=stressedSyllable,
multiword=multiword)
def _prepRESearchStr(matchStr, wordInitial='ok', wordFinal='ok',
spanSyllable='ok', stressedSyllable='ok'):
'''
Prepares a user's RE string for a search
'''
# Protect sounds that are two characters
# After this we can assume that each character represents a sound
# (We'll revert back when we're done processing the RE)
replList = [(u'ei', u'9'), (u'', u'='), (u'', u'~'),
(u'', u'@'), (u'', u'%'), (u'ɑɪ', u'&'),
(u'ɔi', u'$')]
# Add to the replList
currentReplNum = 0
startI = 0
for left, right in (('(', ')'), ('[', ']')):
while True:
try:
i = matchStr.index(left, startI)
except ValueError:
break
j = matchStr.index(right, i) + 1
replList.append((matchStr[i:j], str(currentReplNum)))
currentReplNum += 1
startI = j
for charA, charB in replList:
matchStr = matchStr.replace(charA, charB)
# Characters to check between all other characters
# Don't check between all other characters if the character is already
# in the search string or
interleaveStr = None
stressOpt = (stressedSyllable == 'ok' or stressedSyllable == 'only')
spanOpt = (spanSyllable == 'ok' or spanSyllable == 'only')
if stressOpt and spanOpt:
interleaveStr = u"\.?ˈ?"
elif stressOpt:
interleaveStr = u"ˈ?"
elif spanOpt:
interleaveStr = u"\.?"
if interleaveStr is not None:
matchStr = interleaveStr.join(matchStr)
# Setting search boundaries
# We search on '[^\.#]' and not '.' so that the search doesn't span
# multiple syllables or words
if wordInitial == 'only':
matchStr = u'#' + matchStr
elif wordInitial == 'no':
# Match the closest preceeding syllable. If there is none, look
# for word boundary plus at least one other character
matchStr = u'(?:\.[^\.#]*?|#[^\.#]+?)' + matchStr
else:
matchStr = u'[#\.][^\.#]*?' + matchStr
if wordFinal == 'only':
matchStr = matchStr + u'#'
elif wordFinal == 'no':
matchStr = matchStr + u"(?:[^\.#]*?\.|[^\.#]+?#)"
else:
matchStr = matchStr + u'[^\.#]*?[#\.]'
# For sounds that are designated two characters, prevent
# detecting those sounds if the user wanted a sound
# designated by one of the contained characters
# Forward search ('a' and not 'ab')
insertList = []
for charA, charB in [(u'e', u'i'), (u't', u'ʃ'), (u'd', u'ʒ'),
(u'o', u'ʊ'), (u'a', u'ʊ|ɪ'), (u'ɔ', u'i'), ]:
startI = 0
while True:
try:
i = matchStr.index(charA, startI)
except ValueError:
break
if matchStr[i + 1] != charB:
forwardStr = u'(?!%s)' % charB
# matchStr = matchStr[:i + 1] + forwardStr + matchStr[i + 1:]
startI = i + 1 + len(forwardStr)
insertList.append((i + 1, forwardStr))
# Backward search ('b' and not 'ab')
for charA, charB in [(u't', u'ʃ'), (u'd', u'ʒ'),
(u'a|o', u'ʊ'), (u'e|ɔ', u'i'), (u'ɑ' u'ɪ'), ]:
startI = 0
while True:
try:
i = matchStr.index(charB, startI)
except ValueError:
break
if matchStr[i - 1] != charA:
backStr = u'(?<!%s)' % charA
# matchStr = matchStr[:i] + backStr + matchStr[i:]
startI = i + 1 + len(backStr)
insertList.append((i, backStr))
insertList.sort()
for i, insertStr in insertList[::-1]:
matchStr = matchStr[:i] + insertStr + matchStr[i:]
# Revert the special sounds back from 1 character to 2 characters
for charA, charB in replList:
matchStr = matchStr.replace(charB, charA)
# Replace special characters
replDict = {"D": u"(?:t(?!ʃ)|d(?!ʒ)|[sz])", # dentals
"F": u"[ʃʒfvszɵðh]", # fricatives
"S": u"(?:t(?!ʃ)|d(?!ʒ)|[pbkg])", # stops
"N": u"[nmŋ]", # nasals
"R": u"[rɝɚ]", # rhotics
"V": u"(?:aʊ|ei|oʊ|ɑɪ|ɔi|[iuæɑɔəɛɪʊʌ]):?", # vowels
"B": u"\.", # syllable boundary
}
for char, replStr in replDict.items():
matchStr = matchStr.replace(char, replStr)
return matchStr
def search(searchList, matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok'):
'''
Searches for matching words in the dictionary with regular expressions
wordInitial, wordFinal, spanSyllable, stressSyllable, and multiword
can take three different values: 'ok', 'only', or 'no'.
Special search characters:
'D' - any dental; 'F' - any fricative; 'S' - any stop
'V' - any vowel; 'N' - any nasal; 'R' - any rhotic
'#' - word boundary
'B' - syllable boundary
'.' - anything
For advanced queries:
Regular expression syntax applies, so if you wanted to search for any
word ending with a vowel or rhotic, matchStr = '(?:VR)#', '[VR]#', etc.
'''
# Run search for words
matchStr = _prepRESearchStr(matchStr, wordInitial, wordFinal,
spanSyllable, stressedSyllable)
compiledRE = re.compile(matchStr)
retList = []
for word, pronList in searchList:
newPronList = []
for pron in pronList:
searchPron = pron.replace(",", "").replace(" ", "")
# Ignore diacritics for now:
for diacritic in diacriticList:
if diacritic not in matchStr:
searchPron = searchPron.replace(diacritic, "")
if numSyllables is not None:
if numSyllables != searchPron.count('.') + 1:
continue
# Is this a compound word?
if multiword == 'only':
if searchPron.count('#') == 2:
continue
elif multiword == 'no':
if searchPron.count('#') > 2:
continue
matchList = compiledRE.findall(searchPron)
if len(matchList) > 0:
if stressedSyllable == 'only':
if all([u"ˈ" not in match for match in matchList]):
continue
if stressedSyllable == 'no':
if all([u"ˈ" in match for match in matchList]):
continue
# For syllable spanning, we check if there is a syllable
# marker inside (not at the border) of the match.
if spanSyllable == 'only':
if all(["." not in txt[1:-1] for txt in matchList]):
continue
if spanSyllable == 'no':
if all(["." in txt[1:-1] for txt in matchList]):
continue
newPronList.append(pron)
if len(newPronList) > 0:
retList.append((word, newPronList))
retList.sort()
return retList
def _parsePronunciation(pronunciationStr): def _parsePronunciation(pronunciationStr):
''' '''
@@ -76,13 +307,13 @@ def _parsePronunciation(pronunciationStr):
stressedPhoneList = [] stressedPhoneList = []
for i, syllable in enumerate(syllableList): for i, syllable in enumerate(syllableList):
for j, phone in enumerate(syllable): for j, phone in enumerate(syllable):
if "'" in phone: if u"ˈ" in phone:
stressedSyllableList.insert(0, i) stressedSyllableList.insert(0, i)
stressedPhoneList.insert(0, j) stressedPhoneList.insert(0, j)
break break
elif '"' in phone: elif u'ˌ' in phone:
stressedSyllableList.insert(i) stressedSyllableList.append(i)
stressedPhoneList.insert(j) stressedPhoneList.append(j)
return syllableList, stressedSyllableList, stressedPhoneList return syllableList, stressedSyllableList, stressedPhoneList
+2 -1
View File
@@ -1,3 +1,4 @@
#encoding: utf-8
''' '''
Created on Oct 22, 2014 Created on Oct 22, 2014
@@ -76,7 +77,7 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
stressJ = None # stressJ = None #
if stressI is not None: if stressI is not None:
syllableList[stressI][stressJ] += "'" syllableList[stressI][stressJ] += u"ˈ"
i = 0 i = 0
# print(syllableList) # print(syllableList)
+2 -1
View File
@@ -1,3 +1,4 @@
#encoding: utf-8
''' '''
Created on Oct 15, 2014 Created on Oct 15, 2014
@@ -151,7 +152,7 @@ def _findBestPronunciation(isleDict, wordText, aPron):
hasStress = False hasStress = False
for syllable in syllableList: for syllable in syllableList:
for phone in syllable: for phone in syllable:
hasStress = "'" in phone or hasStress hasStress = u"ˈ" in phone or hasStress
if hasStress: if hasStress:
withStress.append(i) withStress.append(i)
+5 -2
View File
@@ -1,16 +1,19 @@
#!/usr/bin/env python
# encoding: utf-8
''' '''
Created on Oct 15, 2014 Created on Oct 15, 2014
@author: tmahrt @author: tmahrt
''' '''
import codecs
from distutils.core import setup from distutils.core import setup
setup(name='pysle', setup(name='pysle',
version='1.3.0', version='1.4.0',
author='Tim Mahrt', author='Tim Mahrt',
author_email='timmahrt@gmail.com', author_email='timmahrt@gmail.com',
package_dir={'pysle':'pysle'}, package_dir={'pysle':'pysle'},
packages=['pysle'], packages=['pysle'],
license='LICENSE', license='LICENSE',
long_description=open('README.rst', 'r').read(), long_description=codecs.open('README.rst', 'r', encoding="utf-8").read(),
# install_requires=[], # No requirements! # requires 'from setuptools import setup' # install_requires=[], # No requirements! # requires 'from setuptools import setup'
) )
@@ -1,3 +1,4 @@
#encoding: utf-8
''' '''
Created on Oct 22, 2014 Created on Oct 22, 2014
@@ -12,16 +13,18 @@ from pysle import pronunciationtools
# In this first example we look up the syllabification of a word and get it's # In this first example we look up the syllabification of a word and get it's
# stress information. # stress information.
searchWord = 'pumpkins' searchWord = 'catatonic'
isleDict = isletool.LexicalTool('islev2.txt') isleDict = isletool.LexicalTool('ISLEdict.txt')
lookupResults = isleDict.lookup(searchWord) lookupResults = isleDict.lookup(searchWord)
firstEntry = lookupResults[0] firstEntry = lookupResults[0]
firstSyllableList = firstEntry[0] firstSyllableList = firstEntry[0]
firstSyllableList = ".".join([u" ".join(syllable) for syllable in firstSyllableList])
firstStressList = firstEntry[1] firstStressList = firstEntry[1]
print(searchWord) print(searchWord)
print(firstSyllableList, firstStressList) # 3rd syllable carries stress print(firstSyllableList)
print(firstStressList) # 3rd syllable carries stress
# Here we determine the syllabification of a word, as it was said. # Here we determine the syllabification of a word, as it was said.
@@ -35,10 +38,14 @@ returnList = pronunciationtools.findBestSyllabification(isleDict,
searchWord, searchWord,
anotherPhoneList) anotherPhoneList)
stressedSyllable, syllableList, syllabification, stressedIndex = returnList (stressedSyllable, syllableList, syllabification,
stressedSyllableIndexList, stressedPhoneIndexList,
flattenedStressIndexList) = returnList
print(searchWord) print(searchWord)
print(anotherPhoneList) print(anotherPhoneList)
print(syllableList) # We can see the first syllable was elided print(stressedSyllableIndexList) # We can see the first syllable was elided
print(stressedPhoneIndexList)
print(flattenedStressIndexList)
print(syllableList)
print(syllabification)
+55
View File
@@ -0,0 +1,55 @@
#encoding: utf-8
'''
Created on July 08, 2016
@author: tmahrt
Basic examples of common usage.
'''
import random
from pysle import isletool
tmpPath = r"C:\Users\Tim\Dropbox\workspace\pysle\test\ISLEdict.txt"
isleDict = isletool.LexicalTool(tmpPath)
def printOutMatches(matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok', numMatches=None, matchList=None):
if matchList is None:
matchList = isleDict.search(matchStr, numSyllables, wordInitial,
wordFinal, spanSyllable, stressedSyllable,
multiword)
else:
matchList = isletool.search(matchList, matchStr, numSyllables, wordInitial,
wordFinal, spanSyllable, stressedSyllable,
multiword)
if numMatches is not None and len(matchList) > numMatches:
random.shuffle(matchList)
for i, matchTuple in enumerate(matchList):
if numMatches is not None and i > numMatches:
break
word, pronList = matchTuple
print("%s: %s" % (word, ",".join(pronList)))
print("")
return matchList
# 2-syllable words with a stressed syllable containing 'dV' but not word initially
printOutMatches("dV", stressedSyllable="only", spanSyllable="no",
wordInitial="no", numSyllables=2, numMatches=10)
# 3-syllable word with an 'ld' sequence that spans a syllable boundary
printOutMatches("lBd", wordInitial="no", multiword='no',
numSyllables=3, numMatches=10)
# words ending in 'inth'
matchList = printOutMatches(u"ɪ", wordFinal="only", numMatches=10)
# that also start with 's'
matchList = printOutMatches("s", wordInitial="only", numMatches=10,
matchList=matchList, multiword="no")