As discussed in our Data Science: An Untapped Resource within the Public Sector blog, data science is the bedrock for innovation and vital for growth and development within the public sector. However, the terms used can be confusing and hard to understand.
In this post we’ve provided a brief introduction to some of the most popular terms used within data science, to help you understand the different approaches and techniques used to extract vital insight from data . If you need any further guidance on understanding data science terms get in touch with our data scientists today.
Correlation analysis studies relationships between a variable of interest and an explanatory variable. For example, a variable of interest could be consumption, while an explanatory variable could be GDP (Gross Domestic Product) per capita. If the relationship proves to be statistically significant, the explanatory variable is said to be related or associated to the variable of interest. Parameters such as r-squared and p-value are used to assess the strength of the relationship. At SVGC, we use Correlation Analysis for identifying sensitivities everything from documents to programme risks, an example of this can be seen in our work with FCDO Services
(Multivariable) Regression Analysis
Regression analysis is a more general form of correlation analysis, where the relationships between one variable of interest and several explanatory variables are measured. For example, the variable of interest could be consumption while the explanatory variables could be GDP per capita, commodity prices, new product launches and so on. Regression analysis helps to understand how changes in explanatory variables affect the variable of interest. It is widely used for predictions and forecasts.
Multivariate regressions analysis studies the relationship between several variables of interest against several explanatory variables. For example, the variables of interest could be consumption of various items, while the explanatory variables could be GDP per capita, commodity pieces, new products launches, population demographics and so on. Multivariate regression analysis helps to understand how different changes in explanatory variables affect the variables of interest.
At SVGC, we use Regression Analysis for our work on Net Assessment of international defence and security capabilities.
Supply Chain Inventory Modelling
Supply chain inventory modelling is a method of data analysis used to quantify the impact of item characteristics on item consumption. In simple terms, supply chain inventory modelling gives weights to different factors that affect item consumptions. The weights can be determined using, for example, multivariable regression modelling.
Decision Tree Analysis
Decision analysis is a general name given to techniques that analyse every possible outcome of a decision. A decision tree is a diagram that visualises the outcomes and can be easily interpreted. They can help understand and evaluate risks and uncertainties. They also can help answer questions such as: What are the factors that affect the consumption of an item the most? Can we predict an outcome having made a change? At SVGC, we use Decision Tree Analysis for our work on Net Assessment of international defence and security capabilities.
CHAID Analysis (Chi Squared Automatic Interaction Detector)
CHAID is a type of a decision tree algorithm that determines relationships between the variable of interest (for example, the number of demands of a particular item) and the independent variables (for example, certain characteristics- environment, installation vehicle and usage). CHAID automatically creates the decision tree based on the trends and patterns within the data. It can then help understand an outcome having made a change to something and is often used for item segmentation.
Cluster analysis is an exploratory data analysis method that helps identify meaningful structures within data. It defines areas/groups/segments of data that share similarities across several measures. In the marketing industry, cluster analysis is often used to identify item segments. CHAID is also often used for item segmentation, but is a very different algorithm to cluster analysis. Cluster analysis treats all the variables in the data uniformly, while CHAID analysis recognises the variable of interest and independent variables and treats them differently. At SVGC, we use Cluster Analysis to help us to structure unstructured data for Big Data projects.
There are two different statistical approaches to gaining insights from data: frequentist (or classical) and Bayesian. The frequentist approach builds a model based only on the data observed, while the Bayesian approach allows some subjective beliefs about the model to be incorporated with the observations. At SVGC, we use Bayesian modelling to optimise Operational Management for our clients by predicting the best allocation of tasks, enhanced by local knowledge. Evidence of this work can be seen in our work with the DNO.
Prediction Interval/ Confidence Interval
A confidence interval is a range of values that is likely to contain an unknown value of a variable. A prediction interval is a type of confidence interval that can be used for values that are yet to be observed. For example, the level of demand represents a variable of interest. If we know from experience that the level of demand is never the forecast demand of 100, but plus or minus 15 95% of the time- then we would say that we are 95% confident that the demand level is between 85 and 115.
Machine learning is a form of Artificial Intelligence (AI) and is a method of data analysis that iteratively ‘learns’ from data as it arrives without human intervention. Machine learning can analyse large amounts of data quickly to enable smarter decisions in real time and deliver insights into complex behaviours. An extension of machine learning is Deep Learning which combines a number of machine learning algorithms for more advanced computer models. At SVGC, we use Machine Learning for a variety of tasks including our Digital Sensitivity Review projects with FCDO Services.
The definition of Big Data is commonly described by the three V’s: Volume, Velocity and Variety. The following datasets could be considered Big Data: vehicle usage data at the point of use, item demand patterns across all held inventory, or social media data. These types of data can help deliver insights that allow businesses to react to their issues in real-time (eg. data strategies, supply chain adjustments). Big Data requires new technologies, such as Hadoop and Spark, for storage and processing. At SVGC, we use Big Data analysis techniques for challenges including topic modelling and identification of similarity. Evidence of this work can be seen in our work with FCDO Services.