The New Scientist on the Block: the Data Scientist

By Will Lansing

http://www.innercityscience.org/#!will-lansing/ezbb2

Introduction– Who I really am

I have always loved science, though when I was young it seemed more like magic. Science was dinosaurs roaming the Earth millions of years ago being wiped out by a comet that had been flying around in space for billions of years. The sheer scale of everything in Science amazed me and still does. Science was making “new” discoveries every night. I remember using my grandfather’s (Pap as we called him) telescope to look at the moon at night and was fascinated by the fact that the moon moved so quickly out of the view of the telescope. Later in school this fascination was turned in to amazement when I learned how more than 300 years ago Isaac Newton created a formula to precisely (although not as precisely as Einstein) calculate the orbit of the moon and other planets. There was no doubt in my mind back then that I was going to be either an astronaut or some kind of time traveling dinosaur hunter.

Fast forward a few decades and I am most certainly not an astronaut and I have yet to travel back through time, though I think we can all agree that this blog post would be a lot cooler if I had. Although my love for Science has not changed, I do have a new focus in both Technology and Math. With the incredible rate at which the speed of technology is increasing (see Moore’s Law) and the amount of data that is being collected, I find myself in a relatively new and exciting field as a Data Scientist, but more on that later.

Outside of my pursuits in STEM subjects, I am a husband, father of three girls, and the owner of an ever increasing number of pets. I am an avid reader of presidential history and a second degree black belt in Ju Jitsu. I truly enjoy learning and plan to apply for courses to start a Masters of Applied Statistics this fall.

Materials and Methods– How I got here

The Sciences were always my favorite subjects for as far back as I can remember (except Biology, blah). In high school I was able to expand upon this with classes in Chemistry and Physics where our teacher allowed us to perform all kinds of experiments that resulted in things catching fire or crashing down to earth (and if we were really lucky both at the same time). Although my passion had always been with Science, I wasn’t sure what kind of “real” job I could apply it towards, so I left for college and declared as a Political Science major with expectations to go on to Law school. I was only part way through my first semester when I started to seriously doubt my decision to go into Law. Luckily I was able to pick up an Intro to Programming class during that semester and by the end of the first semester I was ready to make the jump to Computer Science. The major focused on the key principles and practices of computing, and the mathematical and scientific principles that underpin them. There was a lot of math, so so much math. Calculus 1, 2 and 3, Discrete Math, a year of Statistics and Linear Algebra. I did my best to hold my head above water, but I never understood how most of these classes applied to programing.

After graduation I started working for R+L Shared Services as a Programming Analyst. My primary job at the time was to write reports against databases for the business to use in order to make decisions. The problem in most cases was that the data was either in too many different reports, too old, not accurate, or not understood enough to be used. At that time the IT department was creating reports but most of them were just being ignored by the business.

Dilbert

 http://dilbert.com/

After a year or two of generating reams of unused reports, the IT department and business started to work together to identify critical data needs for the business and establish guidelines for the way that data should be used. I began working with a team to develop a data warehouse, which is a single place to pull data into from all the other source systems in the company. By putting all the data in one place we could speed up reporting and ensure that when a user in Sales ran a report it would match the same numbers as a user in Finance pulling a similar report. This may sound like a simple process but it has taken almost 10 years to develop it to the place that it is now. Each day the data warehouse parses hundreds of millions of rows of data to ensure the business is able to look at trends in our data to try to determine how the company is performing.

This now brings us to the most exciting and challenging part of my job; we know how the company has been performing, but how will we perform in the future…enter the newest scientist on the block, the Data Scientist.

Results– What I do now

Past performance does not necessarily predict future results. This is such an important fact that the government requires mutual funds (the people that manage all your parents’ money) to basically write that statement on anything they give to investors. What that really means to me as a Data Scientist is I can’t look at past numbers on how the company was performing in a vacuum. We have to consider many other variables that may have had some effect on the company such as economic indicators, fuel prices, job reports, etc.

Remember all those math classes that I didn’t understand why I needed, well I wish I had paid a bit more attention in them. We are constantly applying different mathematical principles from many of the different fields in mathematics such as regression analysis, derivatives, correlations and many others.

In order to predict what may happen in the future, we build what is called a predictive model. To build the model we must first determine what we are trying to predict and the timeline for the prediction. For example, if we were to try to predict the amount of time it would take a driver to deliver items to customers tomorrow, we would need to know how long it has taken in the past, what the weather conditions will likely be, what day of the week it is (because no one really works on Fridays), how many items the driver has to deliver, how heavy the items are and possibly the time of the year. We would then estimate the time and after the driver makes his deliveries we would record that actual time and validate how well the model did. We do this by looking for correlations and causations (if you don’t get the joke below, then hopefully I can explain this in a later post) in the data and adjust our model for any issues or biases that we have.

XKCD

https://xkcd.com/552/

The model that I have described above is relatively simple given the short timelines involved. The models get much more complicated and contain a lot more room for error the further you look into the future. You can think of it like a weather forecast, if the weather channel tells you it is going to rain today, it probably will, if they tell you it will rain 2 weeks from today, you probably don’t need to run out and buy that umbrella quite yet.

So how do we become better about making predictions, we let computers figure all this out for us.

Discussion– Passion for the subject

Machine learning may sound like something made up for the new Star Wars movie, but the idea has been around since the late 1950’s. The idea is to teach the computer to learn new things without explicitly programing it. How does this factor into the future of predictive analytics? The idea would be to build a base model that the computer then runs simulations against using other information that it has. The computer would then modify the model based on patterns it recognizes on its own to make a better prediction. Machine learning is already a staple at tech giants like Google and Apple (think map recommendations and Siri) but it is now also becoming common place in many businesses and the possibly are nearly limitless.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s