Text Parsing in Python

foremanyellowΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

61 εμφανίσεις

Text Parsing in Python



-

Gayatri Nittala



-

Madhubala Vasireddy

Text Parsing


The three W’s!


Efficiency and Perfection


What is Text Parsing?


common programming task


extract or split a sequence of characters

Why is Text Parsing?


Simple file parsing


A tab separated file


Data extraction


Extract specific information from log file


Find and replace


Parsers


syntactic analysis


NLP


Extract information from corpus


POS Tagging


Text Parsing Methods


String Functions


Regular Expressions


Parsers

String Functions


String module in python


Faster, easier to understand and maintain


If you can do, DO IT!


Different built
-
in functions


Find
-
Replace


Split
-
Join


Startswith and Endswith


Is methods

Find and Replace


find, index, rindex, replace


EX: Replace a string in all files in a directory

files = glob.glob(path)

for line in fileinput.input(files,inplace=1):



lineno = 0



lineno = string.find(line, stext)



if lineno >0:



line =line.replace(stext, rtext)



sys.stdout.write(line)

startswith and endswith


Extract quoted words from the given text

myString = "
\
"123
\
"";

if (myString.startswith("
\
""))


print "string with double quotes“


Find if the sentences are interrogative or
exclamative


What an amazing game that was!


Do you like this?



endings = ('!', '?')

sentence.endswith(endings)

isMethods


to check alphabets, numerals, character
case etc


m = 'xxxasdf ‘


m.isalpha()


False

Regular Expressions


concise way for complex patterns


amazingly powerful


wide variety of operations


when you go beyond simple, think about
regular expressions!


Real world problems


Match IP Addresses, email addresses, URLs


Match balanced sets of parenthesis


Substitute words


Tokenize


Validate


Count


Delete duplicates


Natural Language processing


RE in Python


Unleash the power
-

built
-
in re module


Functions


to compile patterns


complie



to perform matches



match, search, findall, finditer



to perform opertaions on match object



group, start, end, span



to substitute



sub, subn


-

Metacharacters

Compiling patterns


re.complile()


pattern for IP Address


^[0
-
9]+
\
.[0
-
9]+
\
.[0
-
9]+
\
.[0
-
9]+$


^
\
d+
\
.
\
d+
\
.
\
d+
\
.
\
d+$


^
\
d{1,3}
\
.
\
d{1,3}
\
.
\
d{1,3}
\
.
\
d{1,3}$


^([01]?
\
d
\
d?|2[0
-
4]
\
d|25[0
-
])
\
.


([01]?
\
d
\
d?|2[0
-
4]
\
d|25[0
-
5])
\
.


([01]?
\
d
\
d?|2[0
-
4]
\
d|25[0
-
5])
\
.


([01]?
\
d
\
d?|2[0
-
4]
\
d|25[0
-
5])$



Compiling patterns


pattern for matching parenthesis


\
(.*
\
)


\
([^)]*
\
)


\
([^()]*
\
)

Substitute


Perform several string substitutions on a given string

import re

def make_xlat(*args, **kwargs):



adict = dict(*args, **kwargs)



rx = re.compile('|'.join(map(re.escape, adict)))



def one_xlate(match):




return adict[match.group(0)]



def xlate(text):




return rx.sub(one_xlate, text)



return xlate

Count


Split and count words in the given text


p = re.compile(r'
\
W+')


len(p.split('This is a test for split().'))

Tokenize


Parsing and Natural Language Processing


s = 'tokenize these words'


words = re.compile(r'
\
b
\
w+
\
b|
\
$')


words.findall(s)


['tokenize', 'these', 'words']

Common Pitfalls



operations on fixed strings, single character
class, no case sensitive issues


re.sub() and string.replace()


re.sub() and string.translate()


match vs. search


greedy vs. non
-
greedy

PARSERS


Flat and Nested texts


Nested tags, Programming language
constructs


Better to do less than to do more!

Parsing Non flat texts


Grammar


States


Generate tokens and Act on them


Lexer
-

Generates a stream of tokens


Parser
-

Generate a parse tree out of the
tokens


Lex and Yacc

Grammar Vs RE



Floating Point


#
----

EBNF
-
style description of Python
---
#


floatnumber ::= pointfloat | exponentfloat


pointfloat ::= [intpart] fraction | intpart "."


exponentfloat ::= (intpart | pointfloat) exponent


intpart ::= digit+


fraction ::= "." digit+


exponent ::= ("e" | "E") ["+" | "
-
"] digit+


digit ::= "0"..."9"

Grammar Vs RE

pat = r'''(?x)


( # exponentfloat


( # intpart or pointfloat


( # pointfloat


(
\
d+)?[.]
\
d+ # optional intpart with fraction


|


\
d+[.] # intpart with period


) # end pointfloat


|


\
d+ # intpart


) # end intpart or pointfloat


[eE][+
-
]?
\
d+ # exponent


) # end exponentfloat


|


( # pointfloat


(
\
d+)?[.]
\
d+ # optional intpart with fraction


|


\
d+[.] # intpart with period


) # end pointfloat


'''

PLY
-

The Python Lex and Yacc


higher
-
level and cleaner grammar language


LALR(1) parsing


extensive input validation, error reporting,
and diagnostics


Two moduoles lex.py and yacc.py

Using PLY
-

Lex and Yacc


Lex:


Import the [lex] module


Define a list or tuple variable 'tokens', the
lexer is allowed to produce


Define tokens
-

by assigning to a specially
named variable ('t_tokenName')


Build the lexer


mylexer = lex.lex()


mylexer.input(mytext) # handled by yacc

Lex

t_NAME = r'[a
-
zA
-
Z_][a
-
zA
-
Z0
-
9_]*'


def t_NUMBER(t):


r'
\
d+'


try:


t.value = int(t.value)


except ValueError:


print "Integer value too large", t.value


t.value = 0


return t


t_ignore = "
\
t"


Yacc


Import the 'yacc' module


Get a token map from a lexer


Define a collection of grammar rules


Build the parser


yacc.yacc()


yacc.parse('x=3')

Yacc


Specially named functions having a 'p_' prefix




def p_statement_assign(p):




'statement : NAME "=" expression'




names[p[1]] = p[3]




def p_statement_expr(p):




'statement : expression'




print p[1]

Summary




String Functions



A thumb rule
-

if you can do, do it.


Regular Expressions



Complex patterns
-

something beyond simple!


Lex and Yacc



Parse non flat texts
-

that follow some rules


References


http://docs.python.org/


http://code.activestate.com/recipes/langs/python/


http://www.regular
-
expressions.info/


http://www.dabeaz.com/ply/ply.html


Mastering Regular Expressions by Jeffrey E F.
Friedl


Python Cookbook by Alex Martelli, Anna Martelli &
David Ascher


Text processing in Python by David Mertz

Thank You

Q & A