insert takes 2 arguments

In python 3.4 and 2.7 I get an error "insert takes 2 arguments". This PR fixes it, but I had to assume it was intended to be an append.
BUGFIX: Dental and Stop special keys don't match multichar sounds like tʃ
2026-07-04 17:20:13 +08:00 · 2016-10-20 08:27:40 +08:00 · 2016-07-20 16:35:44 +02:00 · 2016-07-18 17:10:22 +02:00 · 2016-07-18 17:08:02 +02:00 · 2016-07-18 17:06:50 +02:00
8 changed files with 340 additions and 27 deletions
@@ -31,7 +31,7 @@ What can you do with this library?
 - map an actual pronunciation to a dictionary pronunciation (can be used 
  to automatically find speech errors)::
-    pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['kh', 'ae',]) 
+    pysle.pronunciationtools.findClosestPronunciation(isleDict, 'cat', ['k', 'æ',])
 - automatically syllabify a praat textgrid containing words and phones 
  (e.g. force-aligned text) -- requires my 
@@ -39,10 +39,23 @@ What can you do with this library?
    pysle.syllabifyTextgrid(isleDict, praatioTextgrid, "words", "phones")
 - search for words based on pronunciation::
    e.g. Words that start with a sound, or have a sound word medially, or 
    in stressed vowel position, etc.
    see /tests/dictionary_search.py
 Major revisions
 ================
 Ver 1.4 (July 9, 2016)
 - added search functionality
 - ported code to use the new unicode IPA-based isledict
  (the old one was ascii)
 Ver 1.3 (March 15, 2016)
 - added indicies for stressed vowels
@@ -64,12 +77,14 @@ Requirements
 ================
 - Before you use this library (before or after installing it) you will need
-  to download the ILSEX dictionary.  It can be downloaded here:
+  to download the ILSEX dictionary.  It can be downloaded here under the
  section 'English' linked under the text 'English Pronlex'
  (with a file name of ISLEdict.txt):
-  `ISLEX project page <http://www.isle.illinois.edu/sst/data/dict/>`_
+  `ISLEX project page <http://isle.illinois.edu/sst/data/g2ps/>`_
  `Direct link to the ISLEX file used in this project
-  <http://www.isle.illinois.edu/sst/data/dict/islex/islev2.txt>`_ (islev2.txt)
+  <http://isle.illinois.edu/sst/data/g2ps/English/ISLEdict.txt>`_ (ISLEdict.txt)
 - ``Python 2.7.*`` or above
@@ -103,7 +118,7 @@ Here is a typical common usage::
    from pysle import isle
    isleDict = isle.LexicalTool('C:\islev2.dict')
    print isleDict.lookup('catatonic')[0] # Get the first pronunciation
-    >> [['kh', '@,'], ['t_(', '&'], ['th', "A'"], ['n', 'I', 'kh']] [2]
+    >> [['k', 'ˌæ'], ['t˺', 'ə'], ['t', 'ˈɑ'], ['n', 'ɪ', 'k']] [2, 0]
 and another::
@@ -111,7 +126,7 @@ and another::
    from psyle import pronunciationTools
    searchWord = 'another'
-    anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced
+    anotherPhoneList = ['n', '@', 'th', 'r'] # Actually produced (ASCII or IPA ok here)
    returnList = pronunciationTools.findBestSyllabification(isleDict, 
                                                            searchWord, 
@@ -128,7 +143,7 @@ Citing pysle
 Pysle is general purpose coding and doesn't need to be cited
 (you should cite the
-`ISLEX project <http://www.isle.illinois.edu/sst/data/dict/islex/index.shtml>`_
+`ISLEX project <http://isle.illinois.edu/sst/data/g2ps/>`_
 instead) but if you would like to, it can be cited like so:
 Tim Mahrt. Pysle. https://github.com/timmahrt/pysle, 2016.
@@ -139,7 +154,7 @@ Acknowledgements
 Development of Pysle was possible thanks to NSF grant **IIS 07-03624**
 to Jennifer Cole and Mark Hasegawa-Johnson, NSF grant **BCS 12-51343**
-to Jennifer Cole, José Hualde, and Caroline Smith, and
+to Jennifer Cole, José Hualde, and Caroline Smith, and
-to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
+to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
-funded by the Investissements d’Avenir French Government program, managed
+funded by the Investissements d'Avenir French Government program, managed
 by the French National Research Agency (ANR).
@@ -1,11 +1,33 @@
 #encoding: utf-8
 '''
 Created on Oct 11, 2012
@author: timmahrt
 '''
 import io
 import re
-vowelList = ['a', '@', 'e', 'i', 'o', 'u', '^', '&', '>', ]
+
 charList = [u'#', u'.', u'aʊ', u'b', u'd', u'dʒ', u'ei', u'f', u'g',
            u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'oʊ', u'p',
            u'r', u's', u't', u'tʃ', u'u', u'v', u'w', u'z', u'æ',
            u'ð', u'ŋ', u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
            u'ɪ', u'ɵ', u'ɹ', u'ʃ', u'ʊ', u'ʒ', u'æ', u'ʌ', ]
 diacriticList = [u'˺', u'ˌ', u'̩', u'̃', ]
 vowelList = [u'aʊ', u'ei', u'i', u'oʊ', u'u', u'æ',
             u'ɑ', u'ɑɪ', u'ɔ', u'ɔi', u'ə', u'ɚ', u'ɛ', u'ɝ',
             u'ɪ', u'ʊ', u'ʌ', ]
 def isVowel(char):
    return any([vowel in char for vowel in vowelList])
 def sequenceMatch(matchChar, searchStr):
    return matchChar in searchStr
 class WordNotInISLE(Exception):
@@ -30,7 +52,8 @@ class LexicalTool():
        Builds the isle textfile into a dictionary for fast searching
        '''
        lexDict = {}
-        wordList = [line.rstrip('\n') for line in open(self.islePath, "rU")]
+        with io.open(self.islePath, "r", encoding='utf-8') as fd:
            wordList = [line.rstrip('\n') for line in fd]
        for row in wordList:
            word, pronunciation = row.split(" ", 1)
@@ -60,6 +83,214 @@ class LexicalTool():
        return pronList
    def search(self, matchStr, numSyllables=None, wordInitial='ok',
               wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
               multiword='ok'):
        return search(self.data.items(), matchStr, numSyllables=numSyllables,
                      wordInitial=wordInitial, wordFinal=wordFinal,
                      spanSyllable=spanSyllable,
                      stressedSyllable=stressedSyllable,
                      multiword=multiword)
 def _prepRESearchStr(matchStr, wordInitial='ok', wordFinal='ok',
                     spanSyllable='ok', stressedSyllable='ok'):
    '''
    Prepares a user's RE string for a search
    '''
    # Protect sounds that are two characters
    # After this we can assume that each character represents a sound
    # (We'll revert back when we're done processing the RE)
    replList = [(u'ei', u'9'), (u'tʃ', u'='), (u'oʊ', u'~'),
                (u'dʒ', u'@'), (u'aʊ', u'%'), (u'ɑɪ', u'&'),
                (u'ɔi', u'$')]
    # Add to the replList
    currentReplNum = 0
    startI = 0
    for left, right in (('(', ')'), ('[', ']')):
        while True:
            try:
                i = matchStr.index(left, startI)
            except ValueError:
                break
            j = matchStr.index(right, i) + 1
            replList.append((matchStr[i:j], str(currentReplNum)))
            currentReplNum += 1
            startI = j
    for charA, charB in replList:
        matchStr = matchStr.replace(charA, charB)
    # Characters to check between all other characters
    # Don't check between all other characters if the character is already
    # in the search string or
    interleaveStr = None
    stressOpt = (stressedSyllable == 'ok' or stressedSyllable == 'only')
    spanOpt = (spanSyllable == 'ok' or spanSyllable == 'only')
    if stressOpt and spanOpt:
        interleaveStr = u"\.?ˈ?"
    elif stressOpt:
        interleaveStr = u"ˈ?"
    elif spanOpt:
        interleaveStr = u"\.?"
    if interleaveStr is not None:
        matchStr = interleaveStr.join(matchStr)
    # Setting search boundaries
    # We search on '[^\.#]' and not '.' so that the search doesn't span
    # multiple syllables or words
    if wordInitial == 'only':
        matchStr = u'#' + matchStr
    elif wordInitial == 'no':
        # Match the closest preceeding syllable.  If there is none, look
        # for word boundary plus at least one other character
        matchStr = u'(?:\.[^\.#]*?|#[^\.#]+?)' + matchStr
    else:
        matchStr = u'[#\.][^\.#]*?' + matchStr
    if wordFinal == 'only':
        matchStr = matchStr + u'#'
    elif wordFinal == 'no':
        matchStr = matchStr + u"(?:[^\.#]*?\.|[^\.#]+?#)"
    else:
        matchStr = matchStr + u'[^\.#]*?[#\.]'
    # For sounds that are designated two characters, prevent
    # detecting those sounds if the user wanted a sound
    # designated by one of the contained characters
    # Forward search ('a' and not 'ab')
    insertList = []
    for charA, charB in [(u'e', u'i'), (u't', u'ʃ'), (u'd', u'ʒ'),
                         (u'o', u'ʊ'), (u'a', u'ʊ|ɪ'), (u'ɔ', u'i'), ]:
        startI = 0
        while True:
            try:
                i = matchStr.index(charA, startI)
            except ValueError:
                break
            if matchStr[i + 1] != charB:
                forwardStr = u'(?!%s)' % charB
 #                 matchStr = matchStr[:i + 1] + forwardStr + matchStr[i + 1:]
                startI = i + 1 + len(forwardStr)
                insertList.append((i + 1, forwardStr))
    # Backward search ('b' and not 'ab')
    for charA, charB in [(u't', u'ʃ'), (u'd', u'ʒ'),
                         (u'a|o', u'ʊ'), (u'e|ɔ', u'i'), (u'ɑ' u'ɪ'), ]:
        startI = 0
        while True:
            try:
                i = matchStr.index(charB, startI)
            except ValueError:
                break
            if matchStr[i - 1] != charA:
                backStr = u'(?<!%s)' % charA
 #                 matchStr = matchStr[:i] + backStr + matchStr[i:]
                startI = i + 1 + len(backStr)
                insertList.append((i, backStr))
    insertList.sort()
    for i, insertStr in insertList[::-1]:
        matchStr = matchStr[:i] + insertStr + matchStr[i:]
    # Revert the special sounds back from 1 character to 2 characters
    for charA, charB in replList:
        matchStr = matchStr.replace(charB, charA)
    # Replace special characters
    replDict = {"D": u"(?:t(?!ʃ)|d(?!ʒ)|[sz])",  # dentals
                "F": u"[ʃʒfvszɵðh]",  # fricatives
                "S": u"(?:t(?!ʃ)|d(?!ʒ)|[pbkg])",  # stops
                "N": u"[nmŋ]",  # nasals
                "R": u"[rɝɚ]",  # rhotics
                "V": u"(?:aʊ|ei|oʊ|ɑɪ|ɔi|[iuæɑɔəɛɪʊʌ]):?",  # vowels
                "B": u"\.",  # syllable boundary
                }
    for char, replStr in replDict.items():
        matchStr = matchStr.replace(char, replStr)
    return matchStr
 def search(searchList, matchStr, numSyllables=None, wordInitial='ok',
           wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
           multiword='ok'):
    '''
    Searches for matching words in the dictionary with regular expressions
    wordInitial, wordFinal, spanSyllable, stressSyllable, and multiword
    can take three different values: 'ok', 'only', or 'no'.
    Special search characters:
    'D' - any dental; 'F' - any fricative; 'S' - any stop
    'V' - any vowel; 'N' - any nasal; 'R' - any rhotic
    '#' - word boundary
    'B' - syllable boundary
    '.' - anything
    For advanced queries:
    Regular expression syntax applies, so if you wanted to search for any
    word ending with a vowel or rhotic, matchStr = '(?:VR)#', '[VR]#', etc.
    '''
    # Run search for words
    matchStr = _prepRESearchStr(matchStr, wordInitial, wordFinal,
                                spanSyllable, stressedSyllable)
    compiledRE = re.compile(matchStr)
    retList = []
    for word, pronList in searchList:
        newPronList = []
        for pron in pronList:
            searchPron = pron.replace(",", "").replace(" ", "")
            # Ignore diacritics for now:
            for diacritic in diacriticList:
                if diacritic not in matchStr:
                    searchPron = searchPron.replace(diacritic, "")
            if numSyllables is not None:
                if numSyllables != searchPron.count('.') + 1:
                    continue
            # Is this a compound word?
            if multiword == 'only':
                if searchPron.count('#') == 2:
                    continue
            elif multiword == 'no':
                if searchPron.count('#') > 2:
                    continue
            matchList = compiledRE.findall(searchPron)
            if len(matchList) > 0:
                if stressedSyllable == 'only':
                    if all([u"ˈ" not in match for match in matchList]):
                        continue
                if stressedSyllable == 'no':
                    if all([u"ˈ" in match for match in matchList]):
                        continue
                # For syllable spanning, we check if there is a syllable
                # marker inside (not at the border) of the match.
                if spanSyllable == 'only':
                    if all(["." not in txt[1:-1] for txt in matchList]):
                        continue
                if spanSyllable == 'no':
                    if all(["." in txt[1:-1] for txt in matchList]):
                        continue
                newPronList.append(pron)
        if len(newPronList) > 0:
            retList.append((word, newPronList))
    retList.sort()
    return retList
 def _parsePronunciation(pronunciationStr):
    '''
@@ -76,13 +307,13 @@ def _parsePronunciation(pronunciationStr):
    stressedPhoneList = []
    for i, syllable in enumerate(syllableList):
        for j, phone in enumerate(syllable):
-            if "'" in phone:
+            if u"ˈ" in phone:
                stressedSyllableList.insert(0, i)
                stressedPhoneList.insert(0, j)
                break
-            elif '"' in phone:
+            elif u'ˌ' in phone:
-                stressedSyllableList.insert(i)
+                stressedSyllableList.append(i)
-                stressedPhoneList.insert(j)
+                stressedPhoneList.append(j)
    return syllableList, stressedSyllableList, stressedPhoneList
@@ -1,3 +1,4 @@
 #encoding: utf-8
 '''
 Created on Oct 22, 2014
@@ -76,7 +77,7 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
            stressJ = None  #
        if stressI is not None:
-            syllableList[stressI][stressJ] += "'"
+            syllableList[stressI][stressJ] += u"ˈ"
        i = 0
 #         print(syllableList)
@@ -1,3 +1,4 @@
 #encoding: utf-8
 '''
 Created on Oct 15, 2014
@@ -151,7 +152,7 @@ def _findBestPronunciation(isleDict, wordText, aPron):
        hasStress = False
        for syllable in syllableList:
            for phone in syllable:
-                hasStress = "'" in phone or hasStress
+                hasStress = u"ˈ" in phone or hasStress
        if hasStress:
            withStress.append(i)
@@ -1,16 +1,19 @@
 #!/usr/bin/env python
 # encoding: utf-8
 '''
 Created on Oct 15, 2014
@author: tmahrt
 '''
 import codecs
 from distutils.core import setup
 setup(name='pysle',
-      version='1.3.0',
+      version='1.4.0',
      author='Tim Mahrt',
      author_email='timmahrt@gmail.com',
      package_dir={'pysle':'pysle'},
      packages=['pysle'],
      license='LICENSE',
-      long_description=open('README.rst', 'r').read(),
+      long_description=codecs.open('README.rst', 'r', encoding="utf-8").read(),
 #       install_requires=[], # No requirements! # requires 'from setuptools import setup'
      )
@@ -1,3 +1,4 @@
 #encoding: utf-8
 '''
 Created on Oct 22, 2014
@@ -12,16 +13,18 @@ from pysle import pronunciationtools
 # In this first example we look up the syllabification of a word and get it's 
 # stress information.
-searchWord = 'pumpkins'
+searchWord = 'catatonic'
-isleDict = isletool.LexicalTool('islev2.txt')
+isleDict = isletool.LexicalTool('ISLEdict.txt')
 lookupResults = isleDict.lookup(searchWord)
 firstEntry = lookupResults[0]
 firstSyllableList = firstEntry[0] 
 firstSyllableList = ".".join([u" ".join(syllable) for syllable in firstSyllableList])
 firstStressList = firstEntry[1]
 print(searchWord)
-print(firstSyllableList, firstStressList) # 3rd syllable carries stress
+print(firstSyllableList)
 print(firstStressList) # 3rd syllable carries stress
 # Here we determine the syllabification of a word, as it was said.
@@ -35,10 +38,14 @@ returnList = pronunciationtools.findBestSyllabification(isleDict,
                                                        searchWord, 
                                                        anotherPhoneList)
-stressedSyllable, syllableList, syllabification, stressedIndex = returnList
+(stressedSyllable, syllableList, syllabification,
-
+stressedSyllableIndexList, stressedPhoneIndexList,
 flattenedStressIndexList) = returnList
 print(searchWord)
 print(anotherPhoneList)
-print(syllableList) # We can see the first syllable was elided
+print(stressedSyllableIndexList) # We can see the first syllable was elided
-
+print(stressedPhoneIndexList)
 print(flattenedStressIndexList)
 print(syllableList)
 print(syllabification)
@@ -0,0 +1,55 @@
 #encoding: utf-8
 '''
 Created on July 08, 2016
@author: tmahrt
 Basic examples of common usage.
 '''
 import random
 from pysle import isletool
 tmpPath = r"C:\Users\Tim\Dropbox\workspace\pysle\test\ISLEdict.txt"
 isleDict = isletool.LexicalTool(tmpPath)
 def printOutMatches(matchStr, numSyllables=None, wordInitial='ok',
                    wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
                    multiword='ok', numMatches=None, matchList=None):
    if matchList is None:
        matchList = isleDict.search(matchStr, numSyllables, wordInitial,
                                    wordFinal, spanSyllable, stressedSyllable,
                                    multiword)
    else:
        matchList = isletool.search(matchList, matchStr, numSyllables, wordInitial,
                                    wordFinal, spanSyllable, stressedSyllable,
                                    multiword)
    if numMatches is not None and len(matchList) > numMatches:
        random.shuffle(matchList)
    for i, matchTuple in enumerate(matchList):
        if numMatches is not None and i > numMatches:
            break
        word, pronList = matchTuple
        print("%s: %s" % (word, ",".join(pronList)))
    print("")
    return matchList
 # 2-syllable words with a stressed syllable containing 'dV' but not word initially
 printOutMatches("dV", stressedSyllable="only", spanSyllable="no",
                wordInitial="no", numSyllables=2, numMatches=10)
 # 3-syllable word with an 'ld' sequence that spans a syllable boundary
 printOutMatches("lBd", wordInitial="no", multiword='no',
                numSyllables=3, numMatches=10)
 # words ending in 'inth'
 matchList = printOutMatches(u"ɪnɵ", wordFinal="only", numMatches=10)
 # that also start with 's'
 matchList = printOutMatches("s", wordInitial="only", numMatches=10,
                            matchList=matchList, multiword="no")
Author	SHA1	Message	Date
M Clark	9e212125b1	insert takes 2 arguments In python 3.4 and 2.7 I get an error "insert takes 2 arguments". This PR fixes it, but I had to assume it was intended to be an append.	2016-10-20 08:27:40 +08:00
Tim Mahrt	1b1903bc0b	BUGFIX: Dental and Stop special keys don't match multichar sounds like tʃ So 't' will be matched but not 'tʃ'	2016-07-20 16:35:44 +02:00
Tim Mahrt	5e64deebe6	BUGFIX: Removed diacritics from strings while searching Unless the user is explicitly searching for the diacritic. Also, added some more documentation.	2016-07-18 17:10:22 +02:00
Tim Mahrt	bce3c8ff23	BUGFIX: Protect () and [] in searches	2016-07-18 17:08:02 +02:00
Tim Mahrt	4056b105c9	BUGFIX: Monophthongs searches no longer match dipthongs This was in the code but the functionality didn't work.	2016-07-18 17:06:50 +02:00
Tim Mahrt	d88ff7d8d9	FEATURE: Updated to new isledict format. Now using unicode IPA It made the code a little more complex and now the system is less typing friendly but is more intuitive (no more guessing how to pronounce a character). Update includes changes to documentation.	2016-07-16 00:49:45 +02:00
Tim Mahrt	4cc4bf85ec	DOCUMENTATION: Formatting fix	2016-07-09 17:26:47 +02:00
Tim Mahrt	4c1a26ed03	DOCUMENTATION: Ready for release v1.4	2016-07-09 17:23:44 +02:00
Tim Mahrt	ac8643678b	REFACTOR: Names follow pep008	2016-07-09 17:23:18 +02:00
Tim Mahrt	b76454f626	FEATURE: Added searching of words by pronunciation Based on regular expressions with some keyword parameters to simplify search queries. A set of examples is provided.	2016-07-09 17:22:12 +02:00
timmahrt	ea0bc5c5cd	BUGFIX: Unicode error in installation file python 2.x	2016-07-08 00:19:35 +02:00
timmahrt	5d70367bfc	BUGFIX: OS X default encoding with io.open is whack I'm changing everything to utf-8	2016-06-29 23:58:38 +02:00
Tim Mahrt	81257bdfaf	BUGFIX: Brought example file up-to-date with code	2016-06-28 20:27:35 +02:00
Tim Mahrt	88f79d63e8	BUGFIX: 'U' mode depreciated in open in favor of io.open 'U' is universal line support mode, which is the default mode in io.open	2016-06-28 20:23:29 +02:00
Tim Mahrt	2dcb92217d	BUGFIX: ASCII installation files with unicode caused PIP problems	2016-03-22 12:21:21 +01:00