A solution to a common AI problem

by | Nov 12, 2024

AI solutions
Anthony Wynne

By Anthony Wynne, SVGC Senior Data Scientist

This news article about Citi Bank is a classic example of what I frequently see and hear about with digital transformation work. Everyone wants to board the AI train; no one wants to clean up their data (Or consider the cost of cleaning up their data)

AI numbers

I am sure most data scientists have encountered the difficulties of reliably extracting data from Excel reports that have been lovingly created by people who are endlessly creative in organising data, its relationships and visuals.

To the person maintaining the Excel report, changing the name of a column or mixing data types is effortless, but the knock-on effects can be huge to the infrastructure now ingesting that data.

Many companies have tried a top-down approach of using a data warehouse/ lake as a solution. However, this often removes data ownership from talented analysts who know it inside out and care for it.

The large language models that so many companies want to bring into their ecosystem benefited from being trained on data tagged and organised in a standardised way on the internet. One potential solution is for organisations to convert all their data in complex Excel reports into a structure that can be ingested and learnt by a machine learning model. Still, as anybody who has taken over someone’s pet Excel report knows, it will be challenging, time-consuming and expensive to get the true meaning out of the data in an automated way.

So what is the answer? There are some excellent solutions that I have seen that enable the individuals who care for the data and understand it to convert it into a structure that AI systems can use.

One exemplar system I have seen work very well is used by the data science website Kaggle, although it was initially built for a different purpose: data science competitions. It provides a good model of how a company can organise its internal data and data analysis. The Kaggle system allows the siloed data curator, who knows their data very well, to share their data sets and get user feedback. This encourages the data curator to structure and format their data in a way that is easy for machine learning models and data analysts alike to ingest. It also creates the motivation to take pride in their dataset.

The platform then provides a system and infrastructure where anybody can analyse that data in notebooks that are attached to the data, leading to cross-company collaboration and spontaneous self-organised teamwork. The data scientist or analyst can also ingest many different data sets for their project. The results of their analysis are stored with the data sets so others can see the previous work, quickly get up to speed and build upon it. Importantly, the notebooks that use the data are version-controlled and repeatable.

In conclusion, if you are considering a digital transformation, I recommend looking at a Kaggle-like system and considering building a similar infrastructure at your company.

More news:

Judith Armatage – Interview

Judith Armatage – Interview

At SVGC we’re proud to be a small business formed of experienced, highly qualified people operating on a national...