DOCUMENTATION: Formatting fix

DOCUMENTATION: Ready for release v1.4
REFACTOR: Names follow pep008
2026-06-27 16:10:05 +08:00 · 2016-07-09 17:26:47 +02:00 · 2016-07-09 17:23:44 +02:00 · 2016-07-09 17:23:18 +02:00 · 2016-07-09 17:22:12 +02:00 · 2016-07-08 00:19:35 +02:00
8 changed files with 334 additions and 51 deletions
@@ -3,6 +3,9 @@
 pysle
 ---------

+.. image:: https://img.shields.io/badge/license-MIT-blue.svg?
+   :target: http://opensource.org/licenses/MIT
+
 Pronounced like 'p' + 'isle'.

 An interface for the ILSEX (international speech lexicon) dictionary, 
@@ -36,6 +39,36 @@ What can you do with this library?
  
    pysle.syllabifyTextgrid(isleDict, praatioTextgrid, "words", "phones")

+- search for words based on pronunciation::
+
+    e.g. Words that start with a sound, or have a sound word medially, or 
+    in stressed vowel position, etc.
+    
+    see /tests/dictionary_search.py
+    
+Major revisions
+================
+
+Ver 1.4 (July 9, 2016)
+
+- added search functionality
+
+Ver 1.3 (March 15, 2016)
+
+- added indicies for stressed vowels
+
+Ver 1.2 (June 20, 2015)
+
+- Python 3.x support
+
+Ver 1.1 (January 30, 2015)
+
+- word lookup ~65 times faster
+
+Ver 1.0 (October 23, 2014)
+
+- first public release.
+

 Requirements
 ================
@@ -50,6 +83,8 @@ Requirements

 - ``Python 2.7.*`` or above

+- ``Python 3.3.*`` or above
+
 - The `praatIO <https://github.com/timmahrt/praatIO>`_ library is required IF 
  you want to use the textgrid functionality.  It is not required 
  for normal use.
@@ -58,10 +93,12 @@ Requirements
 Installation
 ================

-From a command-line shell, navigate to the directory this is located in 
-and type::
+If you on Windows, you can use the installer found here (check that it is up to date though)
+`Windows installer <http://www.timmahrt.com/python_installers>`_

-	python setup.py install
+Otherwise, to manually install, after downloading the source from github, from a command-line shell, navigate to the directory containing setup.py and type::
+
+    python setup.py install

 If python is not in your path, you'll need to enter the full path e.g.::

@@ -93,5 +130,26 @@ and another::
    >> [["''"], ['n', '@'], ['th', 'r']]
    

-Please see \\test for example usage
+Please see \\examples for example usage

+
+Citing pysle
+===============
+
+Pysle is general purpose coding and doesn't need to be cited
+(you should cite the
+`ISLEX project <http://www.isle.illinois.edu/sst/data/dict/islex/index.shtml>`_
+instead) but if you would like to, it can be cited like so:
+
+Tim Mahrt. Pysle. https://github.com/timmahrt/pysle, 2016.
+
+
+Acknowledgements
+================
+
+Development of Pysle was possible thanks to NSF grant **IIS 07-03624**
+to Jennifer Cole and Mark Hasegawa-Johnson, NSF grant **BCS 12-51343**
+to Jennifer Cole, José Hualde, and Caroline Smith, and
+to the A*MIDEX project (n° **ANR-11-IDEX-0001-02**) to James Sneed German
+funded by the Investissements d'Avenir French Government program, managed
+by the French National Research Agency (ANR).
@@ -4,10 +4,27 @@ Created on Oct 11, 2012
@author: timmahrt
 '''

+import io
+import re
+
+charList = ['#', '&', '&r', '3r', '9r', '>', '>i', '@', 'A', 'D', 'E',
+            'I', 'N', 'S', 'T', 'U', 'Z', '^', 'a', 'aI', 'aU', 'b',
+            'd', 'dZ', 'd_(', 'e', 'ei', 'f', 'g', 'h', 'i', 'i:',
+            'j', 'k', 'kh', 'l', 'l=', 'm', 'n', 'n=', 'oU', 'p',
+            'ph', 'r', 's', 'sh', 't', 'tS', 't_(', 'th', 'u',
+            'v', 'w', 'y', 'z']

 vowelList = ['a', '@', 'e', 'i', 'o', 'u', '^', '&', '>', ]


+def isVowel(char):
+    return any([vowel in char for vowel in vowelList])
+
+
+def sequenceMatch(matchChar, searchStr):
+    return matchChar in searchStr
+
+
 class WordNotInISLE(Exception):
    
    def __init__(self, word):
@@ -30,7 +47,8 @@ class LexicalTool():
        Builds the isle textfile into a dictionary for fast searching
        '''
        lexDict = {}
-        wordList = [line.rstrip('\n') for line in open(self.islePath, "rU")]
+        with io.open(self.islePath, "r", encoding='utf-8') as fd:
+            wordList = [line.rstrip('\n') for line in fd]
            
        for row in wordList:
            word, pronunciation = row.split(" ", 1)
@@ -60,6 +78,123 @@ class LexicalTool():
        
        return pronList

+    def search(self, matchStr, numSyllables=None, wordInitial='ok',
+               wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
+               multiword='ok'):
+        return search(self.data.items(), matchStr, numSyllables=numSyllables,
+                      wordInitial=wordInitial, wordFinal=wordFinal,
+                      spanSyllable=spanSyllable,
+                      stressedSyllable=stressedSyllable,
+                      multiword=multiword)
+
+
+def search(searchList, matchStr, numSyllables=None, wordInitial='ok',
+           wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
+           multiword='ok'):
+    '''
+    Searches for matching words in the dictionary with regular expressions
+    
+    wordInitial, wordFinal, spanSyllable, stressSyllable, and multiword
+    can take three different values: 'ok', 'only', or 'no'.
+    
+    Special search characters:
+    'V' - any vowel
+    'R' - any rhotic
+    '#' - word boundary
+    'B' - syllable boundary
+    '.' - anything
+    
+    Regular expression syntax applies, so if you wanted to search for any
+    word ending with a vowel or rhotic, matchStr = '(?:VR)#'
+    '''
+    
+    # Characters to check between all other characters
+    # Don't check between all other characters if the character is already
+    # in the search string or
+    interleaveStr = None
+    stressOpt = (stressedSyllable == 'ok' or stressedSyllable == 'only')
+    spanOpt = (spanSyllable == 'ok' or spanSyllable == 'only')
+    if stressOpt and spanOpt:
+        interleaveStr = "\.?'?"
+    elif stressOpt:
+        interleaveStr = "'?"
+    elif spanOpt:
+        interleaveStr = "\.?"
+    
+    if interleaveStr is not None:
+        matchStr = interleaveStr.join(matchStr)
+    
+    # Setting search boundaries
+    # We search on '[^\.#]' and not '.' so that the search doesn't span
+    # multiple syllables or words
+    if wordInitial == 'only':
+        matchStr = '#' + matchStr
+    elif wordInitial == 'no':
+        # Match the closest preceeding syllable.  If there is none, look
+        # for word boundary plus at least one other character
+        matchStr = '(?:\.[^\.#]*?|#[^\.#]+?)' + matchStr
+    else:
+        matchStr = '[#\.][^\.#]*?' + matchStr
+    
+    if wordFinal == 'only':
+        matchStr = matchStr + '#'
+    elif wordFinal == 'no':
+        matchStr = matchStr + "(?:[^\.#]*?\.|[^\.#]+?#)"
+    else:
+        matchStr = matchStr + '[^\.#]*?[#\.]'
+    
+    # Replace special characters
+    replDict = {"V": "(?:aI|aU|ei|oU|[AEIaeiu]):?",
+                "R": "[&39]?r",
+                "B": "\."}
+    
+    for char, replStr in replDict.items():
+        matchStr = matchStr.replace(char, replStr)
+    
+    # Run search for words
+    compiledRE = re.compile(matchStr)
+    retList = []
+    for word, pronList in searchList:
+        newPronList = []
+        for pron in pronList:
+            searchPron = pron.replace(",", "").replace(" ", "")
+            if numSyllables is not None:
+                if numSyllables != searchPron.count('.') + 1:
+                    continue
+            
+            # Is this a compound word?
+            if multiword == 'only':
+                if searchPron.count('#') == 2:
+                    continue
+            elif multiword == 'no':
+                if searchPron.count('#') > 2:
+                    continue
+            
+            matchList = compiledRE.findall(searchPron)
+            if len(matchList) > 0:
+                if stressedSyllable == 'only':
+                    if all(["'" not in match for match in matchList]):
+                        continue
+                if stressedSyllable == 'no':
+                    if all(["'" in match for match in matchList]):
+                        continue
+                
+                # For syllable spanning, we check if there is a syllable
+                # marker inside (not at the border) of the match.
+                if spanSyllable == 'only':
+                    if all(["." not in txt[1:-1] for txt in matchList]):
+                        continue
+                if spanSyllable == 'no':
+                    if all(["." in txt[1:-1] for txt in matchList]):
+                        continue
+                newPronList.append(pron)
+        
+        if len(newPronList) > 0:
+            retList.append((word, newPronList))
+    
+    retList.sort()
+    return retList
+

 def _parsePronunciation(pronunciationStr):
    '''
@@ -69,21 +204,22 @@ def _parsePronunciation(pronunciationStr):
    secondary stress locations
    '''
    syllableTxt = pronunciationStr.split("#")[1].strip()
-    syllableList = [x for x in syllableTxt.split(' . ')]
+    syllableList = [x.split() for x in syllableTxt.split(' . ')]
    
    # Find stress
-    stressList = []
+    stressedSyllableList = []
+    stressedPhoneList = []
    for i, syllable in enumerate(syllableList):
-        # Primary stress
-        if "'" in syllable:
-            stressList.insert(0, i)
-        # Secondary stress
-        elif '"' in syllable:
-            stressList.append(i)
+        for j, phone in enumerate(syllable):
+            if "'" in phone:
+                stressedSyllableList.insert(0, i)
+                stressedPhoneList.insert(0, j)
+                break
+            elif '"' in phone:
+                stressedSyllableList.insert(i)
+                stressedPhoneList.insert(j)
    
-    syllableList = [x.split(" ") for x in syllableList]
-    
-    return syllableList, stressList
+    return syllableList, stressedSyllableList, stressedPhoneList
            
            
 def getNumPhones(isleDict, label, maxFlag):
@@ -11,7 +11,7 @@ class OptionalFeatureError(ImportError):
        return "ERROR: You must have praatio installed to use pysle.praatTools"

 try:
-    import praatio
+    from praatio import tgio
 except ImportError:
    raise OptionalFeatureError()

@@ -39,7 +39,8 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
        skipLabelList = []
    
    syllableEntryList = []
-    tonicEntryList = []
+    tonicSEntryList = []
+    tonicPEntryList = []
    for start, stop, word in wordTier.entryList:
        
        if word in skipLabelList:
@@ -63,8 +64,20 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
            continue
        
        syllableList = returnList[1]
-        stressIndexList = returnList[3]
-        
+        stressedSyllableIndexList = returnList[3]
+        stressedPhoneIndexList = returnList[4]
+        flattenedPhoneIndexList = returnList[5]
+
+        try:
+            stressI = stressedSyllableIndexList[0]
+            stressJ = stressedPhoneIndexList[0]
+        except IndexError:
+            stressI = None  # Function word probably
+            stressJ = None  #
+            
+        if stressI is not None:
+            syllableList[stressI][stressJ] += "'"
+
        i = 0
 #         print(syllableList)
        for k, syllable in enumerate(syllableList):
@@ -84,24 +97,28 @@ def syllabifyTextgrid(isleDict, tg, wordTierName, phoneTierName,
        
            syllableEntryList.append((syllableStart, syllableEnd, label))
            
-            # Create the tonic tier entry
-            try:
-                stressIndex = stressIndexList[0]
-            except IndexError:
-                stressIndex = None  # Function word probably
-                
-            tonicLabel = ''
-            if k == stressIndex:
-                tonicLabel = 'T'
-                
-            tonicEntryList.append((syllableStart, syllableEnd, tonicLabel))
+            # Create the tonic syllable tier entry
+            if k == stressI:
+                tonicSEntryList.append((syllableStart, syllableEnd, 'T'))
+            
+            # Create the tonic phone tier entry
+            if k == stressI:
+                syllablePhoneTier = phoneTier.crop(syllableStart, syllableEnd,
+                                                   True, False)[0]
+            
+                phoneList = [entry for entry in syllablePhoneTier.entryList
+                             if entry[2] != '']
+                phoneStart, phoneEnd = phoneList[stressJ][:2]
+                tonicPEntryList.append((phoneStart, phoneEnd, 'T'))
    
    # Create a textgrid with the two syllable-level tiers
-    syllableTier = praatio.IntervalTier("syllable", syllableEntryList)
-    tonicTier = praatio.IntervalTier('tonic', tonicEntryList)
+    syllableTier = tgio.IntervalTier("syllable", syllableEntryList)
+    tonicSTier = tgio.IntervalTier('tonicSyllable', tonicSEntryList)
+    tonicPTier = tgio.IntervalTier('tonicVowel', tonicPEntryList)
    
-    syllableTG = praatio.Textgrid()
+    syllableTG = tgio.Textgrid()
    syllableTG.addTier(syllableTier)
-    syllableTG.addTier(tonicTier)
+    syllableTG.addTier(tonicSTier)
+    syllableTG.addTier(tonicPTier)

    return syllableTG
@@ -243,16 +243,16 @@ def alignPronunciations(pronI, pronA):
    
    # Fill in any blanks such that the sequential items have the same
    # index and the two strings are the same length
-    for x in xrange(len(sequenceIndexListA)):
+    for x in range(len(sequenceIndexListA)):
        indexA = sequenceIndexListA[x]
        indexI = sequenceIndexListI[x]
        if indexA < indexI:
-            for x in xrange(indexI - indexA):
+            for x in range(indexI - indexA):
                pronA.insert(indexA, "''")
            sequenceIndexListA = [val + indexI - indexA
                                  for val in sequenceIndexListA]
        elif indexA > indexI:
-            for x in xrange(indexA - indexI):
+            for x in range(indexA - indexI):
                pronI.insert(indexI, "''")
            sequenceIndexListI = [val + indexA - indexI
                                  for val in sequenceIndexListI]
@@ -275,13 +275,25 @@ def findBestSyllabification(isleDict, wordText, actualPronunciationList):
    alignedPhoneList = alignedAPronList[bestIndex]
    alignedSyllables = alignedSyllableList[bestIndex]
    syllabification = isleWordList[bestIndex][0]
-    stressedIndex = isleWordList[bestIndex][1]
+    stressedSyllableIndexList = isleWordList[bestIndex][1]
+    stressedPhoneIndexList = isleWordList[bestIndex][2]
    
    stressedSyllable, syllableList = _syllabifyPhones(alignedPhoneList,
                                                      alignedSyllables,
-                                                      stressedIndex)
+                                                      stressedSyllableIndexList)
    
-    return stressedSyllable, syllableList, syllabification, stressedIndex
+    # Count the index of the stressed phones, if the stress list has
+    # become flattened (no syllable information)
+    flattenedStressIndexList = []
+    for i, j in zip(stressedSyllableIndexList, stressedPhoneIndexList):
+        k = j
+        for l in range(i):
+            k += len(syllableList[l])
+        flattenedStressIndexList.append(k)
+    
+    return (stressedSyllable, syllableList, syllabification,
+            stressedSyllableIndexList, stressedPhoneIndexList,
+            flattenedStressIndexList)


 def findClosestPronunciation(isleDict, wordText, aPron):
@@ -1,16 +1,19 @@
+#!/usr/bin/env python
+# encoding: utf-8
 '''
 Created on Oct 15, 2014

@author: tmahrt
 '''
+import codecs
 from distutils.core import setup
 setup(name='pysle',
-      version='1.0.0',
+      version='1.4.0',
      author='Tim Mahrt',
      author_email='timmahrt@gmail.com',
      package_dir={'pysle':'pysle'},
      packages=['pysle'],
      license='LICENSE',
-      long_description=open('README.rst', 'r').read(),
+      long_description=codecs.open('README.rst', 'r', encoding="utf-8").read(),
 #       install_requires=[], # No requirements! # requires 'from setuptools import setup'
-      )
+      )
@@ -35,10 +35,12 @@ returnList = pronunciationtools.findBestSyllabification(isleDict,
                                                        searchWord, 
                                                        anotherPhoneList)

-stressedSyllable, syllableList, syllabification, stressedIndex = returnList
-
+(stressedSyllable, syllableList, syllabification,
+stressedSyllableIndexList, stressedPhoneIndexList,
+flattenedStressIndexList) = returnList
 print(searchWord)
 print(anotherPhoneList)
-print(syllableList) # We can see the first syllable was elided
-
+print(stressedSyllableIndexList) # We can see the first syllable was elided
+print(stressedPhoneIndexList)
+print(flattenedStressIndexList)

@@ -0,0 +1,54 @@
+'''
+Created on July 08, 2016
+
+@author: tmahrt
+
+Basic examples of common usage.
+'''
+
+import random
+
+from pysle import isletool
+
+tmpPath = r"C:\Users\Tim\Dropbox\workspace\pysle\test\islev2.txt"
+isleDict = isletool.LexicalTool(tmpPath)
+
+def printOutMatches(matchStr, numSyllables=None, wordInitial='ok',
+                    wordFinal='ok', spanSyllable='ok', stressedSyllable='ok',
+                    multiword='ok', numMatches=None, matchList=None):
+
+    if matchList is None:
+        matchList = isleDict.search(matchStr, numSyllables, wordInitial,
+                                    wordFinal, spanSyllable, stressedSyllable,
+                                    multiword)
+    else:
+        matchList = isletool.search(matchList, matchStr, numSyllables, wordInitial,
+                                    wordFinal, spanSyllable, stressedSyllable,
+                                    multiword)
+    
+    if numMatches is not None and len(matchList) > numMatches:
+        random.shuffle(matchList)
+        
+    for i, matchTuple in enumerate(matchList):
+        if numMatches is not None and i > numMatches:
+            break
+        word, pronList = matchTuple
+        print("%s: %s" % (word, repr(pronList)))
+    print("")
+    
+    return matchList
+
+# 2-syllable words with a stressed syllable containing 'dV' but not word initially
+printOutMatches("dV", stressedSyllable="only", spanSyllable="no",
+                wordInitial="no", numSyllables=2, numMatches=10)
+ 
+# 3-syllable word with an 'ld' sequence that spans a syllable boundary
+printOutMatches("lBd", wordInitial="no", multiword='no',
+                numSyllables=3, numMatches=10)
+
+# words ending in 'inth'
+matchList = printOutMatches("InT", wordFinal="only", numMatches=10)
+
+# that also start with 's'
+matchList = printOutMatches("s", wordInitial="only", numMatches=10,
+                            matchList=matchList, multiword="no")
@@ -12,14 +12,14 @@ This snippet shows you how to use this function.

 from os.path import join

-import praatio
+from praatio import tgio
 from pysle import isletool
 from pysle import praattools

 path = join('.', 'files')
 path = "/Users/tmahrt/Dropbox/workspace/pysle/test/files"

-tg = praatio.openTextGrid(join(path, "pumpkins.TextGrid"))
+tg = tgio.openTextGrid(join(path, "pumpkins.TextGrid"))

 # Needs the full path to the file
 islevPath = '/Users/tmahrt/Dropbox/workspace/pysle/test/islev2.txt'
@@ -29,7 +29,8 @@ isleDict = isletool.LexicalTool(islevPath)
 syllableTG = praattools.syllabifyTextgrid(isleDict, tg, "word", "phone",
                                          skipLabelList=["",])
 tg.addTier(syllableTG.tierDict["syllable"])
-tg.addTier(syllableTG.tierDict["tonic"])
+tg.addTier(syllableTG.tierDict["tonicSyllable"])
+tg.addTier(syllableTG.tierDict["tonicVowel"])
Author	SHA1	Message	Date
Tim Mahrt	4cc4bf85ec	DOCUMENTATION: Formatting fix	2016-07-09 17:26:47 +02:00
Tim Mahrt	4c1a26ed03	DOCUMENTATION: Ready for release v1.4	2016-07-09 17:23:44 +02:00
Tim Mahrt	ac8643678b	REFACTOR: Names follow pep008	2016-07-09 17:23:18 +02:00
Tim Mahrt	b76454f626	FEATURE: Added searching of words by pronunciation Based on regular expressions with some keyword parameters to simplify search queries. A set of examples is provided.	2016-07-09 17:22:12 +02:00
timmahrt	ea0bc5c5cd	BUGFIX: Unicode error in installation file python 2.x	2016-07-08 00:19:35 +02:00
timmahrt	5d70367bfc	BUGFIX: OS X default encoding with io.open is whack I'm changing everything to utf-8	2016-06-29 23:58:38 +02:00
Tim Mahrt	81257bdfaf	BUGFIX: Brought example file up-to-date with code	2016-06-28 20:27:35 +02:00
Tim Mahrt	88f79d63e8	BUGFIX: 'U' mode depreciated in open in favor of io.open 'U' is universal line support mode, which is the default mode in io.open	2016-06-28 20:23:29 +02:00
Tim Mahrt	2dcb92217d	BUGFIX: ASCII installation files with unicode caused PIP problems	2016-03-22 12:21:21 +01:00
Tim Mahrt	a36d7c8d17	REFACTOR: Gave the tonic vowel tier a more representative name	2016-03-16 12:00:19 +01:00
Tim Mahrt	65ac652dea	DOCUMENTATION: Version changed to 1.3 in the setup.py file	2016-03-16 11:17:16 +01:00
Tim Mahrt	ee08c347d5	DOCUMENTATION: Bolding text	2016-03-15 17:51:01 +01:00
Tim Mahrt	c16c68a6ac	DOCUMENTATION: Added to the acknowledgements.	2016-03-15 17:48:24 +01:00
Tim Mahrt	bc4f19c74c	FEATURE: Index to stressed vowel; marking of stressed vowels on textgrids - the index to the stressed syllable was provided in the past. Now the library also includes the index to the stressed vowel. This is provided with relation to the phones in the syllable and all phones in the word. - the code that marks the stressed syllables in the textgrids also now marks the stressed vowels - several variables renamed to be more informative	2016-03-15 17:42:33 +01:00
Tim Mahrt	c19cde7165	DOCUMENTATION: The link in the last update didn't work.	2016-02-18 14:21:06 +01:00
Tim Mahrt	38ebc7f3f9	BUGFIX: Python 3.x compability Changed xrange -> range Also added some documentation and changed the version number.	2016-02-18 14:17:49 +01:00
Tim Mahrt	102e8a7488	DOCUMENTATION: Removed duplicated text	2016-01-25 13:26:40 +01:00
Tim Mahrt	6b786cd00a	DOCUMENTATION: Bolding	2016-01-25 13:25:33 +01:00
Tim Mahrt	fb1e638cb8	DOCUMENTATION: Fixed link	2016-01-25 13:19:55 +01:00
Tim Mahrt	e5acdfce30	DOCUMENTATION: Corrected islex reference, bolded grant numbers.	2016-01-25 13:18:20 +01:00
Tim Mahrt	d47c312de7	DOCUMENTATION: Added requirements text about Python 3 to readme file.	2016-01-25 13:05:32 +01:00
Tim Mahrt	303d9bfcf2	DOCUMENTATION: Added revision information to pysle and more acknowledgements	2016-01-25 13:02:57 +01:00
Tim Mahrt	9c0ccd5748	DOCUMENTATION: Acknowledgements and citing information added	2016-01-25 12:39:43 +01:00
timmahrt	393182500e	REFACTOR: Syncronized changes with the praatio library Optional textgrid functionality requires praatio 2.1.0 or greater.	2015-07-28 14:30:20 -05:00