Zipfian Academy is joining Galvanize

After one year, 70 students, and 93% of our graduates placed into data science roles at top tech companies, we’re joining forces with Galvanize - an innovation ecosystem offering education, venture capital and co-working across the United States. By combining our expertise and resources, we’re able to increase our capacity to provide industry-focused technical education. We couldn’t be more excited.

Zipfian Academy is now part of the Galvanize gSchool, offering industry-focused training in data science and web development.

The value of industry-focused education is increasingly evident, as top technology companies including Facebook, Twitter, Airbnb, Tesla, Uber, Square, and others have hired talent from our program. It’s not only a modern toolkit that matters, but a comprehensive understanding of approaches to common problems faced by industry practitioners.

The problem for high-potential talent is that this kind of education often happens on the job, and it’s difficult for fresh graduates to find a fit. For many companies, experienced talent is increasingly expensive and hard to find. By providing hands-on, real-world experience to talent, we fill in critical skills gaps and graduate candidates that are ready for the job market. We pair industry-focused training with structured hiring support, including individual mentorship by Data Scientists in industry, and a Hiring Day where graduates meet with 20+ employers.

As we’ve grown our network of alumni, hiring partners, and mentors, community building has been a cornerstone of Zipfian Academy. We retain alumni and hiring partners as frequent guest speakers and project mentors. We founded the SF Data Science Meetup and grew its membership to nearly 2,600 individuals. Galvanize shares the same philosophy on the importance of this kind of community and network. Alongside Galvanize, we are building a vibrant ecosystem of industry-focused education and events, co-working, and venture funding for startups all under the same roof.

Zipfian’s Data Science Immersive will be offered alongside GalvanizeU’s accredited Master’s of Engineering in Big Data.

What’s more, this merger represents an exciting milestone in industry-focused technology education. Zipfian Academy will offer its data science 12-week immersive program alongside GalvanizeU’s accredited Master of Engineering in Big Data, which has opened applications for its inaugural class. Leaders of the Galvanize education program will contribute expertise to both the immersive program and Master’s, benefiting students with a focus on industry-ready skills and world-class instruction.

We’re so excited to join Galvanize to equip many more data scientists and engineers with industry-ready skills. Applications are open for the Winter 2015 immersive program at Zipfian Academy, and for the inaugural GalvanizeU Master’s in Big Data. Not ready for a full time commitment? Check out Galvanize's data science workshops. 

 

We Launched “Data Visualization and D3.js” on Udacity!

As a data scientist, your best insights are only as good as your ability to communicate them. Building data visualizations that move and adjust to user input are an incredibly powerful way for your audience to intuitively understand the data.

We’ve come to deeply appreciate the importance of great visualization, and we were thrilled when Cheng-Han Lee and the Udacity team approached us about building a data visualization course with them on their platform. Building the course would allow us to take concepts from our Data Visualization Workshops and make them accessible to learners all over the world. This is the first time we’ve offered education by Zipfian Academy in this format, and we’ve very excited to announce that enrollment is open! 

Data Visualization and D3.js”  is an online course teaching the fundamentals of creating interactive graphics and communicating with data. By exploring the design principles behind the NYTimes graphics, using open source technologies, and refining your communication skills, participants in the course will learn to tell their own stories with data. 

The course teaches participants how to:

  • Communicate clearly with the best visual representation of your data
  • Tell stories, spark discussion, and create calls to actions for readers
  • Design graphics like ones from the NYTimes and other media companies
  • Use open-source web technologies to create an online portfolio of your work
  • Utilize visualization libraries (dimple.js and D3.js) to create graphics

Our course is one step in Udacity’s Data Analysis Nanodegree, a 5-class series on the Udacity platform that together complete the necessary skills for a particular area. Developed in partnership with Facebook, MongoDB, and Zipfian Academy, the Nanodegree allows participants to understand concepts and industry best practices, while building a project portfolio to showcase to prospective employers. Registrants in the Nanodegree can access a coach who will evaluate projects built in the course.

With our class and the others in the track, the series enables participants to: 

  • Wrangle, extract, transform, and load data from various databases, formats, and data sources
  • Use exploratory data analysis techniques to identify meaningful relationships, patterns, or trends from complex data sets
  • Classify unlabeled data or predict into the future with applied statistics and machine learning algorithms
  • Communicate data analysis and findings well through effective data visualizations 

We’re really excited to launch our first open courseware with Udacity.  Enrollment for the Nanodegree is open for a short time from November 10th - 16th. Materials in the Data Visualization and D3.js Course are available any time. 

 

Want to learn the material from the Data Analyst NanoDegree in a full-time, immersive program in San Francisco? Apply by November 21 to the Winter 2015 cohort of our Data Science Immersive, starting January 5. 

 

Alumni Spotlight: Alex Mentch, Data Scientist at Facebook

Zipfian Academy has graduated more than 50 alumni, placing graduates into data science roles at Facebook, Twitter, Airbnb, Tesla, Uber, Square, Coursera, and many more Silicon Valley companies. Participants in our program come from backgrounds in engineering, data analysis, statistics, and occasionally professional poker. Here, we share an interview with Alex Mentch, a graduate from our Winter 2014 Cohort. 

Alex hails originally from Idaho, and studied electrical engineering at Washington University in St. Louis. Looking for a career transition into data science, Alex attended our Winter 2014 cohort where he built a search engine for state legislation. Alex interviewed at Facebook, Uber, Tesla, and Airbnb, and joined Facebook as a Data Scientist on their Product Analytics team.

Tell me about your background. What kind of work did you do at MIT and NASA?

My background is in electrical engineering, focused on controls and robotics. I have a BS and an MS from Washington University in St Louis. My concentration focused on linear algebra, statistics, and stochastic processes - essentially applied math. Controls engineering is about making systems that regulate themselves and respond in ways you want them to, like autopilot or cruise control. I interned at NASA, University of Idaho, and MIT Lincoln Laboratory, and I really liked it. I also worked full-time on missile defense research at Lincoln Lab before entering a PhD program in electrical engineering at the University of Maryland.

How did you get interested in data science?

I liked what I was doing, but the career started to seem too niche. I wanted to work in broader fields that had applications outside of the narrow industry that I’d found myself in. Right around the time I dropped out of the PhD program, I went to a DataKind weekend hackathon where I worked on a project trying to find a correlation between nighttime light intensity maps of Bangladesh and local estimates of poverty. I realized that a lot of what I had enjoyed about my work was actually data science. 


What did you do after deciding to pursue a data science career?

Data science is based on a lot of math that I already knew, but I needed to learn new approaches and tools. I spent the first summer after I dropped out of my PhD program doing Coursera courses in data science and programming. However, I wasn’t making progress as quickly as I wanted to, and didn’t feel like MOOCs would make me qualified for the field. One of my friends completed the Hackbright program so I knew programming bootcamps existed. I typed “data science bootcamp” into Google, and that’s how I found Zipfian Academy.

Why choose Zipfian Academy?

Sure, I looked at the programs at Berkeley, NYU, and Columbia. With one year of PhD experience, I knew I already had most of the skills, and all the math that I needed. What I did need was to learn the right methods and tools, so I didn’t think I needed another year or two of school. A lot of the data science master’s programs seemed to be designed around the idea that you need X number of classes to get a masters, which seemed inefficient. I ruled out a master’s degree right away.

The difference between Zipfian Academy and MOOCs is that learning alongside other humans is really helpful, especially learning from other people who are experienced in the field. I tried to do Coursera courses like they were real courses in a college, but it didn’t work. I was also interested in the connections to industry at Zipfian Academy, which I thought would get me on the right track.

What was it like being in the program at Zipfian Academy?

It was intense. We were there from 8a-9a until well-past dinner on most nights. I appreciated that the program was very focused and hands-on. The lectures were designed to get us started working on our own - they weren’t any longer than they needed to be. Because the program is focused on hands-on work, you make a lot of mistakes in the beginning, but you figure out how to solve them. It’s a really effective way of learning the material. You also develop the intuition that you need in this job - meaning a familiarity with the algorithms and tools that are available to you, and what kind of questions you can ask of the data. That kind of experience helps you do your job faster.

In a typical college or grad school class, you only apply the thing you learned in lecture to the problem given afterwards. But with a guided homework assignment like that, you don’t learn as much about how to discriminate between a set of possible approaches. Zipfian Academy provided that sort of learning that doesn’t usually happen until you’re working in a job.

The program was also very collaborative. We did 3 weeks of pair programming in the course. Even after that, we were still asking each other questions all the time and comparing approaches to solve problems. This helped us learn quite a bit from people who had different backgrounds, and therefore saw the same problems differently. The capstone projects we built were entirely independent, but I still ran things by people in the program with whom I’d worked most closely.

Tell me more about your capstone project. Why build a search engine for state legislation?

Originally, I was thinking about a project related to ALEC, an organization that provides model legislation to state legislators. The organization has members in most states but doesn’t publish its member list or the bills it writes. As a result of ALEC’s activities, state legislatures will have similar bills that are brought forward for discussion at the same time, but the true motivation for the bill isn’t obvious to the citizens of the states it’s being discussed in. I wanted to see if I could tease out networks of state legislators that often sponsor these similar bills.

What I found was that getting state legislation is very difficult. I ended up deciding to build the tool I needed to eventually do the analysis I wanted to do. I was able to get quite a bit of metadata from the Sunlight Foundation, but mostly I built the project by scraping bills from state websites and applying natural language processing techniques to make them searchable.

alex_interview.jpg

What was the data science job search like?

My interview process wasn’t too bad. I mostly used connections from Zipfian Academy’s Hiring Day, as well as AngelList, and Linkedin. As often as possible, I tried to find a friend of a friend that worked at a company to get in, and sometimes I was contacted through Linkedin.

I was talking to both startups and more established companies at first. At a startup, I thought I’d learn more about starting a business than about data science. If I were the only Data Scientist, I was worried I’d be reinventing the wheel over and over. That was why I decided to focus on medium and larger-sized companies.

The typical application process was some sort of initial phone screen, a take-home assignment, then on-site interview with algorithms, SQL questions, and product questions. I’m sure there are other ways they’re probably evaluating you, but you’re never really sure. The on-sites were pretty intense with about three hours of back-to-back interviews.

I didn’t have to prepare for interviews too much. I was planning to answer algorithm questions in Python, so I made sure I was comfortable whiteboarding Python. I also made sure I was ready to talk basic stats, could whiteboard SQL, and that I was at least a little familiar with the company.

I interviewed at Facebook, Airbnb, Tesla, and Uber, and accepted an offer from Facebook on their Product Analytics team.

What advice would you give someone about breaking into data science?

From what I’ve seen, it’s hard to hire a data scientist, but it’s also hard to get hired. I’ve talked to a lot of people who are interested in data science but aren’t really sure if they want to move forward. I keep recommending looking into Zipfian Academy. It’s important to learn the things you need to get into the field as well as gain the connections necessary to get started. Companies often list a PhD as a requirement for data science roles, but the work doesn’t actually require it.

Every company has a different definition of data science and different projects for data scientists to do. Having a certain level of statistics and programming in your background is all that is necessary. From there, it depends on how you want to shape your career direction. For me, math definitely helps and makes it easier. 

If I were to do it all over again, would I make the same choices? The answer is yes. This was definitely the right path for me.

 

Zipfian Academy Launches Partnership with Skymind to Teach Deep Learning

We're providing world-class training, services, and support to speed adoption of deep learning and DeepLearning4j.

Zipfian Academy, the leading provider of immersive training programs focused on practical data science and data engineering skills, is thrilled to announce a new partnership with Skymind to offer a new family of training and services based on the open-source deep learning library Deeplearning4j. In collaboration with Founder of Skymind and Zipfian Academy Adjunct Instructor Adam Gibson, Zipfian Academy is offering training programs that demonstrate how to apply deep learning to solve complex problems like machine vision, time series analysis, natural language processing, and speech recognition. 

About Deep Learning and Deeplearning4j

Before Deeplearning4j, the technologies that power face detection at Facebook, and translation and image search at Google, were only accessible to a very few employed at large technology companies. The creation of an open-source deep-learning library makes the power of deep learning available to practitioners worldwide.

At the cutting-edge of machine perception, deep learning offers improved accuracy over many common machine-learning methods. Deep neural networks are much better equipped to model the complex, non-linear relationships typically encountered with the large, messy datasets found in industry. In practice, this translates to better results and higher accuracy in a variety of domains.

Deeplearning4j is the first open-source, distributed deep learning library written in Java, and one of the only projects with a product roadmap designed for industry applications. With Deeplearning4j, it is possible to run neural networks natively on top of Hadoop/YARN without proprietary software. The library combines the scalability and speed of Java with ease-of-use design from Python’s scikit-learn, making it possible to implement deep learning at scale quickly and simply.

Deep Learning Workshops

Before today, there were no training centers devoted to applying deep learning to commercial problems. Zipfian Academy has invested heavily in innovative teaching methodologies, and is the ideal partner for Skymind to bring the hands-on training needed to bring deep learning to industry at large.

Deep learning workshops begin in September 2014, where engineers will use Deeplearning4j to detect unique faces among thousands of images, extract sentiment from tweets, and turn PDFs into text using OCR. Skymind founder Adam Gibson will act as lead instructor for the workshop series. Zipfian Academy will also be hosting events through the SF Data Science Meetup to provide a high-level overview of the technologies and demonstrate their capability.

More information about deep learning workshops is available here: http://www.zipfianacademy.com/workshops/practical-deep-learning-its-applications

Thanks,

Team Zipfian Academy

Zipfian Academy Launches Data Fellowship and Data Engineering Immersive Program

Zipfian Academy’s immersive training programs help quantitative PhDs, software engineers, and analysts transition into data science careers. One year ago, we launched our Data Science Immersive program to provide practical data science education focused on solving real industry problems.

Today, we are happy to announce the launch of two new programs: the Data Fellowship and the Data Engineering Immersive.

From the start, it’s been clear that a curriculum built on input from hiring partners is the key to preparing our students for industry. We trained over forty exceptional individuals in the last year, selecting from 600 applicants with backgrounds in physics, econometrics, computer science, and mathematics.

We’re now the center of a community of over 5,000 followers and friends, including alumni, mentors, and partner companies. Our hiring network includes 30+ world-class engineering companies, such as Facebook, Eventbrite, Heroku, Khan Academy, Opower, and Silicon Valley Data Science.

We’ve seen first-hand the evolution of data science roles and specializations, such as Data Developer, Data Engineer, and even Data Janitor. Keeping pace with the changing landscape, we are very excited to announce two new offerings: the Data Fellowship and the Data Engineering Immersive program.

The Data Fellowship

The Data Fellowship is designed for quantitative PhDs and data researchers with the experience in machine learning, statistics, and software engineering needed for data science roles. Fellows have the raw talent that matters, but are missing the practical experience employers are looking for. In this program, participants will synthesize their experiences into industry-ready skills, and receive structured support in navigating the data science career landscape. The Data Fellowship has no cost to students provided they accept a position through our hiring program.

Applications are open now for the Summer 2014 program beginning June 30th. Fellows will work with us for 6 weeks, filling in identified knowledge gaps and finishing a project portfolio to demonstrate their skills to companies excited to hire them.

Data Engineering Immersive

The Data Engineering Immersive program is for software engineers who will go on to build data infrastructures that power the services we use every day. We canvassed our hiring partners to determine the right mix of theory and practice for the curriculum, combining established frameworks like Hadoop with newer entrants such as Spark and Storm.

One insight we uncovered from our hiring partners is a disconnect in industry between data science and data engineering. Too often, each knows only a little about the other, and managers are frustrated by data scientists building models that don’t scale in production.

For this reason, the Data Science and Data Engineering immersives will run concurrently: students from both programs will work hand-in-hand on shared curriculum and projects. By pairing together to implement solutions, our data scientists and engineers will be uniquely qualified to collaborate in industry roles.

Applications are now open for the Data Engineering Immersive program, which begins January 2015. Students will need significant software engineering experience to work through the curriculum, and will build industry-relevant skills in distributed systems.

Applications Are Open!

We can’t wait to get started, and in fact, we’ve already begun.

Apply Now: http://zipfianacademy.com/apply

Thanks!

Team Zipfian Academy

The Data Science Mindset

Names like ‘R’, ‘SQL’, and ‘D3’ make data science seem more like alphabet soup than a deliberate practice of working with data. It’s so easy to get lost in the sea of acronyms, packages, and frameworks that we often find our students prematurely optimizing for the right toolset to use, unable to move forward until they have researched every available option. In reality, data science isn’t just about the tools. It’s a mindset: a way of looking at the world. It’s about taking advantage of our modern computers and all of the information that they’re already collecting to study how things work and push the limits of human knowledge just a little bit further. We have a favorite saying around here — data is everything and everything is data. If we begin with this mindset, a lot of data science approaches naturally follow.

Store Everything

Storage is cheap. Collect everything and ask questions later. Store it in the rawest form that is convenient, and don’t worry about how or even when you’re going to analyze it. That part comes later.

Use Existing Data

We’re already storing data — let’s use it. When faced with questions, data scientists regularly adapt the query so that it can be approximately answered with an existing and convenient dataset. The best part of data science is discovering surprising applications of existing stores of data. For example, there is a plethora of satellite imagery of Earth. We can use this data to learn about fertilizer use in Uganda, or use pictures of the Earth at night to estimate rural electrification in developing countries.

Connect Datasets

We’re storing everything, all over the world, inexpensively, for the first time in history. There are many lessons to be learned by utilizing more of this treasure trove. Don’t worry about making the best use out of a single source of data. Focus on connecting disparate datasets rather than tuning your models. Conventional statistics teaches a lot about how to choose analysis methods that are appropriate for your data collection approach and how to tune the models for a specific dataset.

Effective data science is about using a range of datasets, connecting the dots between one set of data and another, such as predicting restaurant health scores based on Yelp reviews. In machine learning speak: it’s often better to collect more features rather than spend days optimizing hyperparameters.

Anything Can Be Quantified

Our culture loves to quantify. If you can turn it into a number, that number can be put into a table. Importantly, that table can now be processed by a computer.

A spreadsheet about sewer overflows is clearly data to most people, but what about a calendar? At first, a calendar might not seem like the sort of data that you analyze with statistics. However, you can also represent a calendar as a spreadsheet and as a graph.

Data science becomes a creative endeavor when peeling away the obvious variables presented to you. Maybe you have a bunch of PDF documents. You could easily extract the text in the PDFs and search through the content. Depending on the problem you are solving, these files hold more interesting information than just the text. You can get the page count, the file size, and the shapes of the pages and the program that created it. There is information hidden in many datasets that goes beyond what’s immediately obvious.

There is a lot of talk about the difference between different kinds of data. There’s “qualitative” vs. “quantitative” and “unstructured” vs. “structured.” To me, there isn’t much difference between “qualitative” and “quantitative” data, nor is there between “unstructured” and “structured” data because I know that I can convert between the different types.

At first, the registration papers of company might not seem like interesting data. They begin as paper, most of the fields are text, and the formats aren’t particularly standardized. But when you put them in a database in a machine-readable format, qualitative data becomes quantitative data that can be used to supplement other data sources.

Send Boring Work to Robots

We no longer live in an era where “computer” refers to someone who carries out calculations. Find yourself doing something over and over? Give it to the bots. As far as data analysis goes, modern computers can be far more effective at rote tasks, such as drawing new graphs with every update of a dataset.

Data collection is a prime example of a task that should be automated. A common scene in university research labs is swaths of grad students handing out paper questionnaires to participants of studies. The data scientist says: collect the data automatically and unobtrusively, using existing systems whenever possible. The supercomputers we carry in our pocket are a great place to start.

This mindset can be applied not only to the data, but also to the process itself. Rather than learning and remembering your entire analysis process, you can write a program that does the whole thing for you, from the original acquisition of the data, to the modeling, to the presentation of results to another person. By making everything a program, you make it easier to find mistakes, to update your analyses, and reproduce your results.

Tools

Once inside the data science mindset, solving interesting problems becomes a function of data acquisition and processing. Computers can fit models and make predictions about datasets that are too big to wrap your head around and convert paper documents into electronic tables. They probably know more about you and your habits than you know yourself! Use the tools available to you, but don’t get caught up on the tools themselves.

Properly discussing these relevant tools is another post (maybe a book), but here’s one thought. While it always helps to have more education, you don’t need a PhD in math or computer science in order to create useful things. Loads of wonderful algorithms have already been implemented for you, and simple algorithms often work quite well. If you’re just getting started, focus on the “plumbing” that connects different datasets and systems together.

Data Science Mindset at Zipfian Academy

Our course teaches many data science tools, but we also teach the data science mindset, because you need both to be a great data scientist. To this end, we organize our 12-week course by projects — such as a recommendation engine or spam filter — rather than software packages or algorithms. We teach the various tools in context of applied projects so students learn how to choose the appropriate tool and how to build the plumbing that connects them.

In the end, it’s not about the newest, trendiest framework or fastest data analysis platform. It’s about finding interesting insights from your data and sharing it with the world. Start small, get your hands dirty, and have fun!

How to Data (Science): Mapping SF Restaurant Inspection Scores

Are you a company or data scientist that would like to get involved? Give us a shout at hello@zipfianacademy.com.

If this post excites you, I encourage you to apply to our 12 week immersive bootcamp (applications close August 5th) where you will learn data science through hands-on exercises and real world projects!

In our previous post, we outlined the best data science resources we have found online. In this post, we’ll walk through our data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. We will explore this data to map the cleanliness of the city and gain perspective on the relative meaning of these scores through statistics. During the analysis, we used a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks.

While the health inspection scores are based on a fixed scale (i.e. threshold for health quality) where each restaurant can be considered a independent random variables, we think there is value in looking at how they are distributed. This does not actually asses the chance of food borne illness or quality of food but simply looks at the scores from an exploratory perspective

Takeaways

Understand the Data Science Process

Learn about essential tools (UNIX, Python and associated libraries)

Be inspired by Open Data and our role as data citizens

All of the code is contained in an IPython notebook and can be viewed or downloaded from Github.

tl;dr

When we analyzed the data, we found that the most common was a perfect score of 100. Interestingly, the distribution is heavily skewed towards high scores (mean of 92, 75% quartile of 98, 25% quartile of 88), and there exists a long tail of restaurants with very low scores.

Plotting the data geographically, we find that there is a large concentration of restaurants with scores below 70 in Chinatown and Civic Center, putting them in the bottom 1/10th of all scores. The contrast in scores between the Financial district and Chinatown is quite interesting: the highest scoring restaurants cluster (FiDi) neighbors the lower scoring cluster (Chinatown). Also of particular interest was the gradient of 24th St. moving from the Noe Valley (high scores) towards the Mission (lower scores). We plan on adding more data to correlate common health violations with scores in those areas. Have any ideas for a health data mashup? Send them our way at hello@zipfianacademy.com

The interactive map below allows you to visualize the the data by scores and density. Check it out and see for yourself:

Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants that fall in the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.

Large blue hexagons represent many high scoring restaurants in an area and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the average color of the two Score: range sliders and its size represents the radius of the hexagons used to compute the binning. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider’s color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a power scale gradient (with exponent 5).

Motivation

Somewhat recently, Yelp announced that it is partnering with Code for America and the City of San Francisco to develop LIVES, an open data standard which allows municipalities to publish restaurant inspection data in a standardized format. This is a step towards a much much more transparent government, leading ultimately to a more engaged citizenry.

To understand what those opaque numbers in restaurant windows mean, I set out to use statistics and data science to better grasp the implications of the ratings.

Process

The entire process has been documented in an IPython notebook here and I hope anyone who is curious will run the code and review the analyses before they take the results at face value (because No one should trust a data scientist).

Some interesting results and insights I have found can be summed up by the plots below.

In order to learn more about the relative rating of each restaurant and find out just how good a 90 is, I simply plotted all the data in a histogram. It turns out (quite surprisingly) that the majority of restaurants score better than 94 and that 100 is the mode of the dataset. This is actually quite comforting to know that so many restaurants score so well, but might make you think twice about eating at your favorite restaurant that happened to score a 90.

The right plot is a binning of the scores into the categories the city defined to give a more qualitative interpretation of the scores (‘Poor’, ‘Needs Improvement’, ‘Adequate’, and ‘Good’). The interesting thing to note about these quantizations of the scores is that the scale is very nonlinear: 0 -> 70, 71 -> 85, 86 -> 90, 91 -> 100.

With such a skewed distribution and nonlinear scales, often our old way of thinking does not directly translate. To get a better grasp on the relative scores of restaurants compared to each other (and potentially other cities) I computed the quantiles for the distribution. This allows us to have a somewhat standardized ranking to compare different scales and distributions in a normalized fashion. It is for this reason that summary statistics can be quite powerful tools for inference and a standard tool in any statistician’s (or data scientist’s) tool belt.

Due to these very basic and easy to implement analyses, I am now a much more informed citizen and realize that scales in general can distort your perception. In school we come to internalize 70 as a passing score, anything better than 90 quite good, and 98-100 to be unheard of… for Berkeley Physics at least ;)

Conclusion

I hope this post showed you that you do not necessarily need to do very complex analyses to get interesting insights and that it inspires folks to get out there and start working with open data. The first step to breaking into data science is to start making, and pick a project that you are passionate about (or always wanted to know the answer to). If you have any questions about restaurant health inspection data, the data science process, or our program and classes please do not hesitate to reach out (or to just say hello!) at jonathan@zipfianacademy.com. Happy Data-ing!

Cheers,

Jonathan

A Practical Intro to Data Science

Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zipfianacademy.com.

There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here we will provide a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science.

At Zipfian Academy, we believe that everyone learns at different paces and in different ways. If you prefer a more structured and intentional learning environment, we run a 12 week immersive bootcamp training people to become data scientists through hands-on projects and real-world applications. 

We would love to hear your opinions on what qualities make great data scientists, what a data science curriculum should cover, and what skills are most valuable for data scientists to know. 

While the information contained in these resources is a great guide and reference, the best way to become a data scientist is to make, create, and share!

Environment

While the emerging field of data science is not tied to any specific tools, there are certain languages and frameworks that have become the bread and butter for those working in the field. We recommend Python as the programming language of choice for aspiring data scientists due to its general purpose applicability, a gentle (or firm) learning curve, and — perhaps the most compelling reason — the rich ecosystem of resources and libraries actively used by the scientific community.

Development

When learning a new language in a new domain, it helps immensely to have an interactive environment to explore and to receive immediate feedback. IPython provides an interactive REPL which also allows you to integrate a wide variety of frameworks (including R) into your Python programs.

Statistics

It is often said that a data scientist is someone who is better at software engineering than a statistician and better at statistics than any software engineer. As such, statistical inference underpins much of the theory behind data analysis and a solid foundation of statistical methods and probability serves as a stepping stone into the world of data science.

Courses

While R is the de facto standard for performing statistical analysis, it has quite a high learning curve and there are other areas of data science for which it is not well suited. To avoid learning a new language for a specific problem domain, we recommend trying to perform the exercises of these courses with Python and its numerous statistical libraries. You will find that much of the functionality of R can be replicated with NumPy, SciPy, matplotlib, and pandas.

Books

Well written books can be a great reference (and supplement) to these courses, and also provide a more independent learning experience. These may be useful if you already have some knowledge of the subject or just need to fill in some gaps in your understanding:

Machine Learning/Algorithms

A solid base of Computer Science and algorithms is essential for an aspiring data scientist. Luckily there are a wealth of great resources online, and machine learning is one of the more lucrative (and advanced) skills of a data scientist.

Courses

Books

Data ingestion and cleaning

One of the most under-appreciated aspects of data science is the cleaning and munging of data that often represents the most significant time sink during analysis. While there is never a silver bullet for such a problem, knowing the right tools, techniques, and approaches can help minimize time spent wrangling data.

Courses

Tutorials

  • Predictive Analytics: Data Preparation: An introduction to the concepts and techniques of sampling data, accounting for erroneous values, and manipulating the data to transform it into acceptable formats.

Tools

  • OpenRefine (formerly Google Refine): A powerful tool for working with messy data, cleaning, transforming, extending it with web services, and linking to databases. Think Excel on steroids.

  • DataWrangler: Stanford research project that provides an interactive tool for data cleaning and transformation.

  • sed: “The ultimate stream editor” — used to process files with regular expressions often used for substitution.

  • awk: “Another cornerstone of UNIX shell programming” — used for processing rows and columns of information.

Visualization

The most insightful data analysis is useless unless you can effectively communicate your results. The art of visualization has a long history, and while being one of the most qualitative aspects of data science its methods and tools are well documented.

Courses

Books

Tutorials

Tools

  • D3.js: Data-Driven Documents — Declarative manipulation of DOM elements with data dependent functions (with Python port).

  • Vega: A visualization grammer built on top of D3 for declarative visualizations in JSON. Released by the dream team at Trifacta, it provides a higher level abstraction than D3 for creating “ or SVG based graphics.

  • Rickshaw: A charting library built on top of D3 with a focus on interactive time series graphs.

  • modest maps: A lightweight library with a simple interface for working with maps in the browser (with ports to multiple languages).

  • Chart.js: Very simple (only six charts) HTML5 “ based plotting library with beautiful styling and animation.

Computing at Scale

When you start operating with data at the scale of the web (or greater), the fundamental approach and process of analysis must change. To combat the ever increasing amount of data, Google developed the MapReduce paradigm. This programming model has become the de facto standard for large scale batch processing since the release of Apache Hadoop in 2007, the open-source MapReduce framework.

Courses

Books

Putting it all together

Data Science is an inherently multidisciplinary field that requires a myriad of skills to be a proficient practitioner. The necessary curriculum has not fit into traditional course offerings, but as awareness of the need for individuals who have such abilities is growing, we are seeing universities and private companies creating custom classes.

Courses

Books

Tutorials

Conclusion

Now this just scratches the surface of the infinitely deep field of Data Science and we encourage everyone to go out and try your hand at some science! We would love for you to join the conversation over @zipfianacademy and let us know if you want to learn more about any of these topics.

Blogs

  • Data Beta: Professor Joe Hellerstein’s blog about education, computing, and data.

  • Dataists: Hilary Mason and Vince Buffalo’s old blog that has a wealth of information and resources about the field and practice of data science.

  • Five Thirty Eight: Nate Silver’s famous NYT blog where he discusses predictive modeling and political forecasts.

  • grep alex: Alex Holmes’s blog about distributed computing and the intricacies of Hadoop.

  • Data Science 101: One man’s personal journey to becoming a data scientist (with plenty of resources)

  • no free hunch: Kaggle’s blog about the practice of data science and its competition highlights.

Resources

If you made it this far, you should check out our 12-week intensive program. Apply here!

Welcome to Zipfian

We enthusiastically welcome anyone and everyone who has a childlike curiosity about not just how the world works, but why the world works to join the Zipfian community!

Won’t you join us on our intellectual journey

We are embarking on a mission to push the bounds of what is possible with data — exploring the core of what it really means to be a data scientist.

Zipfian Academy is an attempt to make the world a bit more transparent — by arming dedicated individuals with the skills necessary to make sense of the mountain of data before us. We dream of a world in which no claim gets accepted without evidence to support it, no information that could improve the state of our world is sequestered, and the tools and knowledge to ask the right questions are open and freely available.

It is in this spirit that we invite you to join the conversation. Tell us a story, give us your opinion — this is a place to be heard. We look forward to hearing from all of you!

— Jonathan & Ryan


P. S. Stay tuned to our feeds to hear about classes, meetups, and other upcoming events.