Python in an evolving enterprise system: Integration solutions with Hadoop*
In 2011, we moved our data pipeline to a Hadoop stack in order to enable horizontal scalability for future growth. However, our optimization tools used for data exploration, aggregations, and general data hackery are built in Python. Over the past few months, we evaluated multiple solutions for integrating Python with Hadoop. In our talk, we'll explore the different Python-Hadoop integration options, share our evaluation process and best practices, and invite an interactive dialogue of lessons learned.
Our data pipeline is growing like crazy, processing more than 30 terabytes of data every day and more than tripling in the last year alone. In 2011, we moved our data pipeline to a Hadoop stack in order to enable horizontal scalability for future growth. Our optimization tools used for data exploration, aggregations, and general data hackery are critical for updating budgets and optimization data. However, these tools are built in Python, and integrating them with our Hadoop data pipeline has been an enormous challenge. Our continued explosive growth demands increased efficiency, whether that’s in simplifying our infrastructure or building more shared services. Over the past few months, we evaluated multiple solutions for integrating Python with Hadoop including using Hadoop Streaming, PIG with Jython UDFs, writing MapReduce in Jython, and of course, why not just do it in Java? In our talk, we’ll explore the different Python-Hadoop integration options, share our evaluation process and best practices, and invite an interactive dialogue of lessons learned.
Will be giving this talk for the first time at PyData SV 2013 in March.
Director of Optimization and Analytics
As Director of Optimization and Analytics, Dave Himrod manages a team of analysts, quants, and engineers devoted to crafting world-class algorithms. When Dave joined in 2009, he managed AppNexus’ first account – eBay. While building AppNexus’ original optimization algorithm, Dave was heavily involved in building out the data-pipeline and defining the data model still in use today. He has since grown his team to more than 30 people and focuses his time on building a world-class scalable optimization system. He and his team continue to improve the tools for optimized pricing and budgeting for the over40 billion ad impressions our platform sees per day. Dave has a Bachelor’s Degree in Computer Science from University of Pennsylvania.
As Engineering Manager for Optimization and Analytics, Steve manages software development for AppNexus’s best-in-class systems for ad transaction optimization. Since joining AppNexus in 2010, Steve has led the design of distributed systems for scalable computation and data processing and set the technical standards for a team of engineers while iterating on the optimization feature set. Previously, Steve was a software developer at Google working on Google Places for Business and Local Search Quality. Steve has a Master’s of Engineering in Electrical Engineering and Computer Science and a Bachelor’s Degree in Computer Science from MIT.
Software Engineer, Optimization and Analytics
As a Software Engineer for Optimization and Analytics, Angelica builds and implements algorithms for AppNexus’ ad transaction optimization systems. Since joining AppNexus in 2011, Angelica has implemented and continues to improve original budgeting and spend pacing algorithms. Previously, Angelica was a software developer working in Research at Deutsche Bank, developing proprietary stock market index investment instruments and providing quantitative analysts with customized equity research tools. Angelica has a Bachelor’s of Electrical and Computer Engineering Degree from Cornell University.