When we ask data leaders to share the challenges they’re facing we rarely hear, “I don’t have enough data.” If anything, most organisations have more data than they know what to do with and for many its use can be limited to historical analysis. While this is an important part of any data journey, when we move beyond data for diagnostics and consider its ability to help us predict behaviours, it becomes a transformational tool in any organisational kitbag.
In our latest webinar, Transform’s Head of Analytics, Michael Baines, and Senior Data Analyst, Patrick Greenway, explained how machine learning can be used to predict behaviour.
Tackling problems before they arise
To kick off, Michael talked through the four stages of analytical maturity.
Descriptive analysis: Level one is the use of analytics to look back at what has already happened – useful for KPIs, dashboards and business intelligence reporting.
Diagnostic analysis: At the next stage of maturity, data is used to identify trends and carry out diagnostic work to understand root causes.
Machine Learning: Coming into play next is the use of forward-looking techniques to predict how users will interact with products and services based on past behaviours. Michael and Patrick described this as the shift from descriptive (understanding what has happened in the past) to prescriptive (knowing what to do in the future) analytics.
Prescriptive Analysis: the final stage enables organisations to build interventions and tackle problems before they arise.
Tools and technologies for building machine learning models
But how do you get to the point where you are using data to predict future states? Patrick and Michael talked through some of the tools and technologies that can be used to build machine learning models.
SAS and IBM’s SPSS may have been the standard platforms in the days of physical servers, but cloud technology has changed everything, and now Microsoft Azure, Amazon Web Services (AWS) and Google Cloud are the platforms of choice. They offer the ability to build machine learning models that can scale outwards and upwards, increasing power when needed.
In terms of integrated development environments (IDEs), Michael recommends RStudio, which has libraries that you can use to build different algorithms, as well as Spyder, an IDE that allows you to write Python code so you can create different data sets and models and see them in a highly visual and joined up way.
Databricks is getting lots of traction in the industry, as it combines the functions mentioned above and features many data engineering principles. It's built on a Spark backend, so it's fast, even when you're running very intense processes.
Also popular is Jupyter Notebook. You write your code, import your data, create different steps, and view your analysis as you go.
The main languages are Python and R – but that's not to say others aren't used. Nowadays, there's more happening in Java, C++, as well as Julia and Go, but to get started, the majority will use Python or R.
The essential 3-step process for your data Michael and Patrick spent some time discussing the process for building a robust data set that you can feed into a machine learning model. It's important to note that the three steps aren’t always linear. Success comes through repetition, and you may go back and change things while exploring the data.
Step 1: Creating your initial data set
What relevant data do you need to gather? What affects the outcome you're predicting? Collaboration is key here – your domain experts will have theories about what's causing the issue you’re exploring. For example, if you’re trying to predict employee turnover, ask HR what factors they think lead to employees leaving the business. This can be a shortcut to identifying the data you need.
Step 2: Understanding the data
There are different techniques you can employ here. For example, correlation plots, kernel density plots and histograms all highlight relationships and useful variables that you'll want to include in your model. It’s also helpful to identify what isn't useful. There might be variables that have no relationship to what you're trying to predict, so you’ll want to drop those before you train your model.
Step 3: Cleaning and preparing the data for the final model removes any unhelpful or incomplete variables
You might also decide that a variable is useful, but there’s not enough data for the model.
For example, you could theorise that pay has an impact on employee turnover. To add this to the model, go back to step 1 and add more relevant data.
Evaluating and improving your model
Patrick and Michael went on to discuss methods of evaluating models through splits.
The train validation test split involves splitting your data into three groups. One chunk of data is fed into the model and the other two are retained for comparison. This results in an unbiased evaluation of the model’s performance.
By running different models using the training data, then validating each one, you’ll understand which model has the highest accuracy, and the best variables and parameters.
Another method is K-Fold Cross Validation, which involves splitting the data into five groups, which you test, refine and combine to get your final model.
Whichever model you choose, Michael stressed that you shouldn’t fear having to write code from scratch. Most machine learning tasks have now been done before and there’s lots you can take and alter from the internet. So before starting any new model, first look at what’s already out there.
There was so much content in this value-packed webinar that you’ll likely want to watch it yourself if you didn’t get to attend the session. The key takeaways included: 1. An essential framework for building machine learning models
2. Understanding which tools and technologies to use
3. Ways to evaluate your model’s performance
Want to watch a full recording of the webinar or talk to us about your data challenges? Then send us an email at transformation@transformUK.com. We’d love to hear from you.