Transcript presented

Python@Work
2007-Jul-12
PkManager.py
Howard Kapustein
Director of Technology and Architecture
Manhattan Associates
Background

Director of Technology and Archicture

Manhattan Associates, 7 years
 EPCglobal Reader Protocol 1.0, Co-Chairman [RFID]
 SMS, Platform Services (Architecture/Subsystems), 12 years
 Open Source submitter (Jython, among others)

20 years experience

Acronym Soup: C++, Java, Visual Basic, Windows, Unix, concurrency,
i18n, security, RDBMS, GUI, web server, XML, TCP, web services,
REST, AJAX, JSON(!), oodles more
 No COBOL, No Perl 

Python since 1997

Switched from AWK

Thompson AWK compiler created EXEs



Tried to learn TCL – not so much
Stumbled over Eric Raymond's “Why Python?” essay


Getting 'long in the tooth'
Made perfect sense  Google for “why python eric raymond”
Python is Beautiful (even back when it was 1.5)



Rich language, Richer library
Thank god for py2exe 
Pywin32 is pretty handy too
Application

Warehouse Management Open Systems (WMOS)

Large C++, CORBA, portable 'enterprise' application




>8 million lines of code
Borland's Visibroker for C++
AIX, HP-UX, Linux, Solaris, Windows
24x7x365 – Near-realtime 'Execution' system


Heavy RF+MHE interaction


99% of activity is high volume, low latency
Heavy customization element

Routinely modified for every customer



i.e. 1 hour outage = millions of dollars
Each customer = Forked codebase
IOW more variables + post-release
Performance, Scalability, Latency, Reliability, Resiliency

The 'not negotiable' family
Problem

CORBA process:

Server = EXE: initializes, registers available factories with ORB, responds to
requests
 Client:




ORB knows what factories are supposed-to-be and actually-available
Borland: If requested factory not running, ORB asks the Object Activation Demon
(OAD) to start it [Just-In-Time Activation]
Problem: OAD stability is abominable


Runs for hours/days, then randomly hangs or crashes for no apparent reason
But JIT support made it popular for non-production (test, dev, …)


Doesn't mean we didn't regularly see support issues due to folks using the OAD 
Homegrown replacements:

PkPad: Unix shell script, pre-start list of processes, polling via ps to determine
premature death to restart


Cons: 30 second sleep between sweep (or huge perf hit), no JIT, no management
PkManager.exe: NT Service, multithreaded, interrupt-driven (no polling)


Factory*f=bind(“factory”); Object*o=f->newInstance();
o->DoStuff(); release o; release f; //aka delete
Cons: Windows only (<20% customers), no JIT
Solution: PkManager.py

Superset: JIT + PreStart, interrupt-based (no polling), administration interface

And by-god-rock-solid-reliable!
Basic Architecture


Global Variable: timeToExit = threading.Event()
Thread 1: Main

Initialize (parse command line etc)
 Start worker threads
 Main loop
while not timeToExit.isSet():
time.sleep(0.1)

Thread 2: Monitor (Process Manager)
while not timeToDie.isSet():
ProcessRequests(); StartChildren(); WaitForDeath()
timeToExit.set()

Thread 3: API (Web Server)




JIT requests
Administration Console
Web Services
Thread 4: Uptime (Reporter)
while not timeToDie.isSet():
print 'Uptime: %s since %s' % (now-startup, startup)
timeToDie.wait(n)
Configuration (DSL)

Configuration file = Domain Specific Language (DSL)


[[wmosprod.dat]]
Python dictionaries are sweet!
Look ma, it's JSON 
symbols={'N':'order', 'OnStart':'#prestart', 'JIT':'#ondemand', …}
config = []
lineno = 0
for line in open('wmosprod.dat').readline().strip():
lineno += 1
try:
entry = eval(line, {}, symbols)
config += entry
except:
print 'Error line %d' % (lineno)
errors += 1
if errors > 0:
raise UserWarning('Uh-oh…')



Users see simple and obvious configuration
Code is maintainable and simple

Mostly to 'nicely' handle and report errors
Signals – Ouch!

TIP: Do this very early
import signal
signals = dir(signal)
if 'SIGBREAK' in signals:
signal.signal(signal.SIGBREAK, signal.default_int_handler)
if 'SIGTERM' in signals:
signal.signal(signal.SIGTERM, signal.default_int_handler)

Surprises
 #1:
SIGBREAK+SIGTERM not always available
 #2: Default action is usually terminate

Now except KeybreakException will trip
Threading

All threads use same basic pattern e.g.
process_timeToExit = threading.Event() #Global
class Thread_Monitor(threading.Thread):
def __init__(self, other, parms, …):
…initialize…
def run(self):
try:
…setup…
while not self.timeToDie.isSet():
…do stuff…
except KeyboardInterrupt:
print 'Ctrl-Break detected; terminating…'
except Exception, e:
print FormatException()
process_timeToExit.set()
def stop(self):
self.timeToDie.set()
threading.Thread.join(self, timeout)

threading.Event is your friend


Global Event to coordinate process termination/cleanup
Per-thread communication



“Thread, Kill Thyself” = Event.set(); “Time to die?” = Event.isSet()
“Thread, Art Thou Dead?” = Thread.join()
Alternative, pair of events:
timeToDie = threading.Event()
iAmDead = threading.Event()
def KillThyself(): timeToDie.set()
def TimeToDie(): timeToDie.isSet()
def IAmDead(): iAmDead.set()
def AreYouDeadYet(): iAmDead.isSet()
FormatException()

Simplify exception reporting
def __function__(nFramesUp=1):
"""Create a string naming the function n frames up on the stack."""
co = sys._getframe(nFramesUp+1).f_code
return "%s (%s @ %d)" % (co.co_name, co.co_filename, co.co_firstlineno)
def FormatException(ei=None):
if ei == None:
ei = sys.exc_info()
info = traceback.format_exception(ei[0], ei[1], ei[2])
return ''.join(info)

Typical usage:
try:
DoSomething()
except SomeException:
print FormatException()

Never catch the exception object, though you can
try:
DoSomething()
except SomeException, e:
print FormatException(e)
KeyboardInterrupt

try block necessary per thread


Raised on the active thread when detected 
Worse, KeyboardInterrupt derives from StandardException

except Exception eats everything



Including KeyboardInterrupt and SystemExit!
Probably not what you wanted…
This coupled with SIGBREAK fun was a bear to figure out

Python 3000 is supposed to 'fix' this


Changing the exception hierarchy!
Should make porting…fun…
Web Server

PkManager predates WSGI's emergence
class PkManagerWebServer(SocketServer.ThreadingMixIn, BaseHTTPServer.HTTPServer):
pass
class PkManagerRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
protocol_version = 'HTTP/1.0'
#3
server_version = 'PkManagerHTTP/' + __version__
def do_HEAD(self):
self.do_GET()
def do_POST(self):
self.ProcessRequest(self.rfile)
def do_Get(self):
requestbody = StringIO()
requestbody.seek(0)
self.ProcessRequest(requestbody)
requestbody.close()
def ProcessRequest(self, requestbody):
…parse url…
name = 'Handler_' + path.replace('/', '_')
handler = self.__class__.__dict__.get(name)
if handler is None:
if not self.ServeStaticFile():
self.ProcessResponse(400)
return
else:
result = handler(self)
if 'Cache-Control' not in headers:
headers['Cache-Control'] = 'private, max-age=0'
self.ProcessResponse(statuscode, body, headers)

#n = Item of interest
#1
#2
#4
#4
#4
#5
#6
#7
#8
#8
Request Handlers

PkManagerRequestHandler methods, e.g.
monitor_requests = Queue.Queue() #Global variable
def Handler__process_start(self):
#1
parms = SplitToKVPairs(self.rfile)
processname = parms.get('exe')
if processname is None:
return (400, 'Missing parameter (exe=<name>)')
timeout = int(parms.get('wait', TimeoutDefault))






#2
iamdone = g_EventCache.get(timeout)
request = (self.effective_path, iamdone, processname)
#3
monitor_requests.put(request)
#4
realtimeout = self.TimeoutMSecToRealValue(timeout) #5
if iamdone != None:
iamdone.wait(realtimeout)
if not iamdone.isset()
return (408, None)
g_EventCache.put(iamdone)
return (200, None)
#6
#7
#1: Method name = 'Handler_' + URL's path component
#2: Parameters are fundamentally URL query parameters
#5: Timeout = N or Infinite or NoWait
#6: Wait up to the timeout
#7: If timeout, HTTP status = 408 Request Timeout
#9: Success! HTTP status = 200 OK
#8
#9
EventCache


New Event() per request = Huge Perf Pig
Took 3 hours to identify bottleneck

Only 20 minutes to solve!  I  Python
class EventCache:
def __init__(self):
self.cache = Queue.Queue()
def get(self, timeout):
if timeout == Timeout_NoWait: return None
try:
event = self.cache.get_nowait()
event.clear()
return event
except Queue.Empty:
return threading.Event()
def put(self, event):
self.cache.put(event)
def __len__(self):
return self.cache.qsize()
g_EventCache = EventCache()

Call get(timeout) for a new Event

Call put(event) to return Event to cache when done


Only if done with the Event
If errors occurred (e.g. timeout), don't put()

Python will clean up the Event object once no longer referenced
Queue.Queue


All inter-thread-communication via Event and Queue
Handler creates a tuple to queue
(resource, event, …parameters…)

Output parameters passed as empty list
Iamdone = Event() : name=[] : age=[] : shoesize=[]
request = (self.effective_path, iamdone, name, age, shoesize)
queue.put(request)
iamdone.wait()
print name[0], age[0], shoesize[0]

Monitor thread pulls requests from queue
def HandleRequests():
try:
while 1:
request = queue.get_nowait()
path = request[0]
name = 'HandleAPIRequest_' + path.replace('/', '_')
handler = globals().get(name) : assert handler != None
handler(request)
except Queue.empty, e:
pass
def HandleAPIRequest__some_service_entrypoint(request):
name=request[2] : age=request[3] : shoesize=request[4]
…do stuff…
name.append(…) : age.append(…) : shoesize.append(…)
iamdone = request[1]
if iamdone != None: iamdone.set()

So effective I ported Queue to C++
Internationalization (i18n)

Initially tried module gettext


Standard. Capable. Simple API. Very similar to GNU gettext API
But…needed simple deployment





“Zero-Install” – anything else is just a support call (or many…)
How to find the message catalogs?
 localedir/language/LC_MESSAGES/domain.mo
Create a 3-level tree, with very fixed names, to drop a bunch of localized text resources?
And what about customization?
Bah. Python to the rescue!

[[PkManagerI18N-*.py]]
i18n={} : i18nMeta={}
def i18nLoad(path):
sys.path.insert(0, path)
for root, paths, filenames in os.walk(path)
if fnmatch.fnmatch(filename, 'PkManagerI18N-*.py'):
name = os.path.splitext(filename)[0]
pathname = os.path.join(root, filename)
try:
module = __import__(name)
text = getattr(module, 'Text', None)
if text != None:
meta = getattr(module, 'Meta', None)
for locale in text.iterkeys():
i18n[locale] = text[locale] : i18nMeta = meta[locale]
except (ImportError, SyntaxError), e:
Abort(5, 'Error loading i18n resource %s' % (filename))
del sys.path[0]
Internationalization (i18n) – Part Deux

Simple format
Text = { 'es': { 'About':'Sobre', 'English' : u'Engl\u00e9s', … } }
Meta = { 'es': { 'Name':'Spanish', 'Display' :u'Espa\u00f1ol' } }

But what about complex languages?  Python source files can use arbitrary encodings!
# -*- coding: utf8 -*Text = { 'zh': { 'About':u'亸乾些亖亃',
… },
'jp': { 'About':u'ノキアについて', … },
'ar': { 'About':u'‫} } 'عن‬
Meta = { 'zh': { 'Name':'Chinese', 'Display':u'中国 ' },
'jp': { 'Name':'Japanese', 'Display':u'日本語 ' },
'ar': { 'Name':'Arabic',
'Display':u'‫} } 'العربية‬

One neat trick in module gettext  _() is defined as ‘lookup-text’. Nifty idea
print _(‘About’)
def _(s, locale=None, language=None):
if locale==None: locale=options.locale
textlist = i18n.get(locale)
if textlist != None:
text = textlist.get(s)
if text != None:
return text
if language != None
textlist = i18n.get(language)
if textlist != None:
text = textlist.get(s)
if text != None:
return text
if isint(s): return s
else: return ‘[%s]’ % (s)
py2exe

Running PkManager.py is natural on Unix


Not so much on Windows
py2exe binds source + runtime into .exe
# setup.py
from distutils.core import setup
import py2exe
setup(name='PkManager',
version=GetVersion(),
description="WMOS process manager, overseer, care and feederer",
author='Manhattan Associates',
url='http://www.manh.com',
console=[{'script':"PkManager.py",
'icon_resources':[(1, 'PkManager.ico')]}],
zipfile=None, #Append to .exe / no separate .zip
data_files=[('.', [os.path.abspath(r‘wmosprod.dat')])],
options={"py2exe":{"compressed":1,
"optimize":2,
"xref":0,
"includes":[],
"dll_excludes":[]}}

Create the executable
python -OO setup.py py2exe

Replace console parameter to compile an NT Service
service=[{'modules':'PkManager',
'script':"PkManager.py",
'icon_resources':[(1, 'PkManager.ico')]}],
Demo
Questions?
Blog: http://blog.kapustein.com
 Email: [email protected]