15 Commits

Author SHA1 Message Date
M Clark 9e212125b1 insert takes 2 arguments
In python 3.4 and 2.7 I get an error  "insert takes 2 arguments". This PR fixes it, but I had to assume it was intended to be an append.
2016-10-20 08:27:40 +08:00
Tim Mahrt 1b1903bc0b BUGFIX: Dental and Stop special keys don't match multichar sounds like tʃ
So 't' will be matched but not 'tʃ'
2016-07-20 16:35:44 +02:00
Tim Mahrt 5e64deebe6 BUGFIX: Removed diacritics from strings while searching
Unless the user is explicitly searching for the diacritic.

Also, added some more documentation.
2016-07-18 17:10:22 +02:00
Tim Mahrt bce3c8ff23 BUGFIX: Protect () and [] in searches 2016-07-18 17:08:02 +02:00
Tim Mahrt 4056b105c9 BUGFIX: Monophthongs searches no longer match dipthongs
This was in the code but the functionality didn't work.
2016-07-18 17:06:50 +02:00
Tim Mahrt d88ff7d8d9 FEATURE: Updated to new isledict format. Now using unicode IPA
It made the code a little more complex and now the system
is less typing friendly but is more intuitive (no more guessing
how to pronounce a character).

Update includes changes to documentation.
2016-07-16 00:49:45 +02:00
Tim Mahrt 4cc4bf85ec DOCUMENTATION: Formatting fix 2016-07-09 17:26:47 +02:00
Tim Mahrt 4c1a26ed03 DOCUMENTATION: Ready for release v1.4 2016-07-09 17:23:44 +02:00
Tim Mahrt ac8643678b REFACTOR: Names follow pep008 2016-07-09 17:23:18 +02:00
Tim Mahrt b76454f626 FEATURE: Added searching of words by pronunciation
Based on regular expressions with some keyword parameters
to simplify search queries.  A set of examples is
provided.
2016-07-09 17:22:12 +02:00
timmahrt ea0bc5c5cd BUGFIX: Unicode error in installation file python 2.x 2016-07-08 00:19:35 +02:00
timmahrt 5d70367bfc BUGFIX: OS X default encoding with io.open is whack
I'm changing everything to utf-8
2016-06-29 23:58:38 +02:00
Tim Mahrt 81257bdfaf BUGFIX: Brought example file up-to-date with code 2016-06-28 20:27:35 +02:00
Tim Mahrt 88f79d63e8 BUGFIX: 'U' mode depreciated in open in favor of io.open
'U' is universal line support mode, which is the default mode
in io.open
2016-06-28 20:23:29 +02:00
Tim Mahrt 2dcb92217d BUGFIX: ASCII installation files with unicode caused PIP problems 2016-03-22 12:21:21 +01:00
8 changed files with 340 additions and 27 deletions
+25 -10
View File
@@ -31,7 +31,7 @@ What can you do with this library?
- map an actual pronunciation to a dictionary pronunciation (can be used
to automatically find speech errors)::
pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['kh', 'ae',])
pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['k', 'æ',])
- automatically syllabify a praat textgrid containing words and phones
(e.g. force-aligned text) -- requires my
@@ -39,10 +39,23 @@ What can you do with this library?
pysle.syllabifyTextgrid(isleDict, praatioTextgrid, "words", "phones")
- search for words based on pronunciation::
e.g. Words that start with a sound, or have a sound word medially, or
in stressed vowel position, etc.
see /tests/dictionary_search.py
Major revisions
================
Ver 1.4 (July 9, 2016)
- added search functionality
- ported code to use the new unicode IPA-based isledict
(the old one was ascii)
Ver 1.3 (March 15, 2016)
- added indicies for stressed vowels
@@ -64,12 +77,14 @@ Requirements
================
- Before you use this library (before or after installing it) you will need
to download the ILSEX dictionary. It can be downloaded here:
to download the ILSEX dictionary. It can be downloaded here under the
section 'English' linked under the text 'English Pronlex'
(with a file name of ISLEdict.txt):
`ISLEX project page <http://www.isle.illinois.edu/sst/data/dict/>`_
`ISLEX project page <http://isle.illinois.edu/sst/data/g2ps/>`_
`Direct link to the ISLEX file used in this project
<http://www.isle.illinois.edu/sst/data/dict/islex/islev2.txt>`_ (islev2.txt)
<http://isle.illinois.edu/sst/data/g2ps/English/ISLEdict.txt>`_ (ISLEdict.txt)
- ``Python 2.7.*`` or above
@@ -103,7 +118,7 @@ Here is a typical common usage::
from pysle import isle
isleDict = isle.LexicalTool('C:\islev2.dict')
print isleDict.lookup('catatonic')[0] # Get the first pronunciation
>> [['kh', '@,'], ['t_(', '&'], ['th', "A'"], ['n', 'I', 'kh']] [2]
>> [['k', 'ˌæ'], ['t˺', 'ə'], ['t', 'ˈɑ'], ['n', 'ɪ', 'k']] [2, 0]
and another::
@@ -111,7 +126,7 @@ and another::
from psyle import pronunciationTools
searchWord = 'another'
anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced
anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced (ASCII or IPA ok here)
returnList = pronunciationTools.findBestSyllabification(isleDict,
searchWord,
@@ -128,7 +143,7 @@ Citing pysle
Pysle is general purpose coding and doesn't need to be cited
(you should cite the
`ISLEX project <http://www.isle.illinois.edu/sst/data/dict/islex/index.shtml>`_
`ISLEX project <http://isle.illinois.edu/sst/data/g2ps/>`_
instead) but if you would like to, it can be cited like so:
Tim Mahrt. Pysle. https://github.com/timmahrt/pysle, 2016.
@@ -139,7 +154,7 @@ Acknowledgements
Development of Pysle was possible thanks to NSF grant **IIS 07-03624**
to Jennifer Cole and Mark Hasegawa-Johnson, NSF grant **BCS 12-51343**
to Jennifer Cole, José Hualde, and Caroline Smith, and
to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
funded by the Investissements dAvenir French Government program, managed
to Jennifer Cole, José Hualde, and Caroline Smith, and
to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
funded by the Investissements d'Avenir French Government program, managed
by the French National Research Agency (ANR).
+237 -6
View File
@@ -1,11 +1,33 @@
#encoding: utf-8
'''
Created on Oct 11, 2012
@author: timmahrt
'''
import io
import re
vowelList = ['a', '@', 'e', 'i', 'o', 'u', '^', '&', '>', ]
charList = [u'#', u'.', u'', u'b', u'd', u'', u'ei', u'f', u'g',
u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'', u'p',
u'r', u's', u't', u'', u'u', u'v', u'w', u'z', u'æ',
u'ð', u'ŋ', u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
u'ɪ', u'ɵ', u'ɹ', u'ʃ', u'ʊ', u'ʒ', u'æ', u'ʌ', ]
diacriticList = [u'˺', u'ˌ', u'̩', u'̃', ]
vowelList = [u'', u'ei', u'i', u'', u'u', u'æ',
u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
u'ɪ', u'ʊ', u'ʌ', ]
def isVowel(char):
return any([vowel in char for vowel in vowelList])
def sequenceMatch(matchChar, searchStr):
return matchChar in searchStr
class WordNotInISLE(Exception):
@@ -30,7 +52,8 @@ class LexicalTool():
Builds the isle textfile into a dictionary for fast searching
'''
lexDict = {}
wordList = [line.rstrip('\n') for line in open(self.islePath, "rU")]
with io.open(self.islePath, "r", encoding='utf-8') as fd:
wordList = [line.rstrip('\n') for line in fd]
for row in wordList:
word, pronunciation = row.split(" ", 1)
@@ -60,6 +83,214 @@ class LexicalTool():
return pronList
def search(self, matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok'):
return search(self.data.items(), matchStr, numSyllables=numSyllables,
wordInitial=wordInitial, wordFinal=wordFinal,
spanSyllable=spanSyllable,
stressedSyllable=stressedSyllable,
multiword=multiword)
def _prepRESearchStr(matchStr, wordInitial='ok', wordFinal='ok',
spanSyllable='ok', stressedSyllable='ok'):
'''
Prepares a user's RE string for a search
'''
# Protect sounds that are two characters
# After this we can assume that each character represents a sound
# (We'll revert back when we're done processing the RE)
replList = [(u'ei', u'9'), (u'', u'='), (u'', u'~'),
(u'', u'@'), (u'', u'%'), (u'ɑɪ', u'&'),
(u'ɔi', u'$')]
# Add to the replList
currentReplNum = 0
startI = 0
for left, right in (('(', ')'), ('[', ']')):
while True:
try:
i = matchStr.index(left, startI)
except ValueError:
break
j = matchStr.index(right, i) + 1
replList.append((matchStr[i:j], str(currentReplNum)))
currentReplNum += 1
startI = j
for charA, charB in replList:
matchStr = matchStr.replace(charA, charB)
# Characters to check between all other characters
# Don't check between all other characters if the character is already
# in the search string or
interleaveStr = None
stressOpt = (stressedSyllable == 'ok' or stressedSyllable == 'only')
spanOpt = (spanSyllable == 'ok' or spanSyllable == 'only')
if stressOpt and spanOpt:
interleaveStr = u"\.?ˈ?"
elif stressOpt:
interleaveStr = u"ˈ?"
elif spanOpt:
interleaveStr = u"\.?"
if interleaveStr is not None:
matchStr = interleaveStr.join(matchStr)
# Setting search boundaries
# We search on '[^\.#]' and not '.' so that the search doesn't span
# multiple syllables or words
if wordInitial == 'only':
matchStr = u'#' + matchStr
elif wordInitial == 'no':
# Match the closest preceeding syllable. If there is none, look
# for word boundary plus at least one other character
matchStr = u'(?:\.[^\.#]*?|#[^\.#]+?)' + matchStr
else:
matchStr = u'[#\.][^\.#]*?' + matchStr
if wordFinal == 'only':
matchStr = matchStr + u'#'
elif wordFinal == 'no':
matchStr = matchStr + u"(?:[^\.#]*?\.|[^\.#]+?#)"
else:
matchStr = matchStr + u'[^\.#]*?[#\.]'
# For sounds that are designated two characters, prevent
# detecting those sounds if the user wanted a sound
# designated by one of the contained characters
# Forward search ('a' and not 'ab')
insertList = []
for charA, charB in [(u'e', u'i'), (u't', u'ʃ'), (u'd', u'ʒ'),
(u'o', u'ʊ'), (u'a', u'ʊ|ɪ'), (u'ɔ', u'i'), ]:
startI = 0
while True:
try:
i = matchStr.index(charA, startI)
except ValueError:
break
if matchStr[i + 1] != charB:
forwardStr = u'(?!%s)' % charB
# matchStr = matchStr[:i + 1] + forwardStr + matchStr[i + 1:]
startI = i + 1 + len(forwardStr)
insertList.append((i + 1, forwardStr))
# Backward search ('b' and not 'ab')
for charA, charB in [(u't', u'ʃ'), (u'd', u'ʒ'),
(u'a|o', u'ʊ'), (u'e|ɔ', u'i'), (u'ɑ' u'ɪ'), ]:
startI = 0
while True:
try:
i = matchStr.index(charB, startI)
except ValueError:
break
if matchStr[i - 1] != charA:
backStr = u'(?<!%s)' % charA
# matchStr = matchStr[:i] + backStr + matchStr[i:]
startI = i + 1 + len(backStr)
insertList.append((i, backStr))
insertList.sort()
for i, insertStr in insertList[::-1]:
matchStr = matchStr[:i] + insertStr + matchStr[i:]
# Revert the special sounds back from 1 character to 2 characters
for charA, charB in replList:
matchStr = matchStr.replace(charB, charA)
# Replace special characters
replDict = {"D": u"(?:t(?!ʃ)|d(?!ʒ)|[sz])", # dentals
"F": u"[ʃʒfvszɵðh]", # fricatives
"S": u"(?:t(?!ʃ)|d(?!ʒ)|[pbkg])", # stops
"N": u"[nmŋ]", # nasals
"R": u"[rɝɚ]", # rhotics
"V": u"(?:aʊ|ei|oʊ|ɑɪ|ɔi|[iuæɑɔəɛɪʊʌ]):?", # vowels
"B": u"\.", # syllable boundary
}
for char, replStr in replDict.items():
matchStr = matchStr.replace(char, replStr)
return matchStr
def search(searchList, matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok'):
'''
Searches for matching words in the dictionary with regular expressions
wordInitial, wordFinal, spanSyllable, stressSyllable, and multiword
can take three different values: 'ok', 'only', or 'no'.
Special search characters:
'D' - any dental; 'F' - any fricative; 'S' - any stop
'V' - any vowel; 'N' - any nasal; 'R' - any rhotic
'#' - word boundary
'B' - syllable boundary
'.' - anything
For advanced queries:
Regular expression syntax applies, so if you wanted to search for any
word ending with a vowel or rhotic, matchStr = '(?:VR)#', '[VR]#', etc.
'''
# Run search for words
matchStr = _prepRESearchStr(matchStr, wordInitial, wordFinal,
spanSyllable, stressedSyllable)
compiledRE = re.compile(matchStr)
retList = []
for word, pronList in searchList:
newPronList = []
for pron in pronList:
searchPron = pron.replace(",", "").replace(" ", "")
# Ignore diacritics for now:
for diacritic in diacriticList:
if diacritic not in matchStr:
searchPron = searchPron.replace(diacritic, "")
if numSyllables is not None:
if numSyllables != searchPron.count('.') + 1:
continue
# Is this a compound word?
if multiword == 'only':
if searchPron.count('#') == 2:
continue
elif multiword == 'no':
if searchPron.count('#') > 2:
continue
matchList = compiledRE.findall(searchPron)
if len(matchList) > 0:
if stressedSyllable == 'only':
if all([u"ˈ" not in match for match in matchList]):
continue
if stressedSyllable == 'no':
if all([u"ˈ" in match for match in matchList]):
continue
# For syllable spanning, we check if there is a syllable
# marker inside (not at the border) of the match.
if spanSyllable == 'only':
if all(["." not in txt[1:-1] for txt in matchList]):
continue
if spanSyllable == 'no':
if all(["." in txt[1:-1] for txt in matchList]):
continue
newPronList.append(pron)
if len(newPronList) > 0:
retList.append((word, newPronList))
retList.sort()
return retList
def _parsePronunciation(pronunciationStr):
'''
@@ -76,13 +307,13 @@ def _parsePronunciation(pronunciationStr):
stressedPhoneList = []
for i, syllable in enumerate(syllableList):
for j, phone in enumerate(syllable):
if "'" in phone:
if u"ˈ" in phone:
stressedSyllableList.insert(0, i)
stressedPhoneList.insert(0, j)
break
elif '"' in phone:
stressedSyllableList.insert(i)
stressedPhoneList.insert(j)
elif u'ˌ' in phone:
stressedSyllableList.append(i)
stressedPhoneList.append(j)
return syllableList, stressedSyllableList, stressedPhoneList
+2 -1
View File
@@ -1,3 +1,4 @@
#encoding: utf-8
'''
Created on Oct 22, 2014
@@ -76,7 +77,7 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
stressJ = None #
if stressI is not None:
syllableList[stressI][stressJ] += "'"
syllableList[stressI][stressJ] += u"ˈ"
i = 0
# print(syllableList)
+2 -1
View File
@@ -1,3 +1,4 @@
#encoding: utf-8
'''
Created on Oct 15, 2014
@@ -151,7 +152,7 @@ def _findBestPronunciation(isleDict, wordText, aPron):
hasStress = False
for syllable in syllableList:
for phone in syllable:
hasStress = "'" in phone or hasStress
hasStress = u"ˈ" in phone or hasStress
if hasStress:
withStress.append(i)
+5 -2
View File
@@ -1,16 +1,19 @@
#!/usr/bin/env python
# encoding: utf-8
'''
Created on Oct 15, 2014
@author: tmahrt
'''
import codecs
from distutils.core import setup
setup(name='pysle',
version='1.3.0',
version='1.4.0',
author='Tim Mahrt',
author_email='timmahrt@gmail.com',
package_dir={'pysle':'pysle'},
packages=['pysle'],
license='LICENSE',
long_description=open('README.rst', 'r').read(),
long_description=codecs.open('README.rst', 'r', encoding="utf-8").read(),
# install_requires=[], # No requirements! # requires 'from setuptools import setup'
)
@@ -1,3 +1,4 @@
#encoding: utf-8
'''
Created on Oct 22, 2014
@@ -12,16 +13,18 @@ from pysle import pronunciationtools
# In this first example we look up the syllabification of a word and get it's
# stress information.
searchWord = 'pumpkins'
isleDict = isletool.LexicalTool('islev2.txt')
searchWord = 'catatonic'
isleDict = isletool.LexicalTool('ISLEdict.txt')
lookupResults = isleDict.lookup(searchWord)
firstEntry = lookupResults[0]
firstSyllableList = firstEntry[0]
firstSyllableList = ".".join([u" ".join(syllable) for syllable in firstSyllableList])
firstStressList = firstEntry[1]
print(searchWord)
print(firstSyllableList, firstStressList) # 3rd syllable carries stress
print(firstSyllableList)
print(firstStressList) # 3rd syllable carries stress
# Here we determine the syllabification of a word, as it was said.
@@ -35,10 +38,14 @@ returnList = pronunciationtools.findBestSyllabification(isleDict,
searchWord,
anotherPhoneList)
stressedSyllable, syllableList, syllabification, stressedIndex = returnList
(stressedSyllable, syllableList, syllabification,
stressedSyllableIndexList, stressedPhoneIndexList,
flattenedStressIndexList) = returnList
print(searchWord)
print(anotherPhoneList)
print(syllableList) # We can see the first syllable was elided
print(stressedSyllableIndexList) # We can see the first syllable was elided
print(stressedPhoneIndexList)
print(flattenedStressIndexList)
print(syllableList)
print(syllabification)
+55
View File
@@ -0,0 +1,55 @@
#encoding: utf-8
'''
Created on July 08, 2016
@author: tmahrt
Basic examples of common usage.
'''
import random
from pysle import isletool
tmpPath = r"C:\Users\Tim\Dropbox\workspace\pysle\test\ISLEdict.txt"
isleDict = isletool.LexicalTool(tmpPath)
def printOutMatches(matchStr, numSyllables=None, wordInitial='ok',
wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
multiword='ok', numMatches=None, matchList=None):
if matchList is None:
matchList = isleDict.search(matchStr, numSyllables, wordInitial,
wordFinal, spanSyllable, stressedSyllable,
multiword)
else:
matchList = isletool.search(matchList, matchStr, numSyllables, wordInitial,
wordFinal, spanSyllable, stressedSyllable,
multiword)
if numMatches is not None and len(matchList) > numMatches:
random.shuffle(matchList)
for i, matchTuple in enumerate(matchList):
if numMatches is not None and i > numMatches:
break
word, pronList = matchTuple
print("%s: %s" % (word, ",".join(pronList)))
print("")
return matchList
# 2-syllable words with a stressed syllable containing 'dV' but not word initially
printOutMatches("dV", stressedSyllable="only", spanSyllable="no",
wordInitial="no", numSyllables=2, numMatches=10)
# 3-syllable word with an 'ld' sequence that spans a syllable boundary
printOutMatches("lBd", wordInitial="no", multiword='no',
numSyllables=3, numMatches=10)
# words ending in 'inth'
matchList = printOutMatches(u"ɪ", wordFinal="only", numMatches=10)
# that also start with 's'
matchList = printOutMatches("s", wordInitial="only", numMatches=10,
matchList=matchList, multiword="no")