This article originally was published on LinkedIn by our CTO on August 14th, 2018. The article has been updated and re-posted here.
In the early days of High-Performance Computing (HPC) the concept of solving complex problems was driven by how much processing power was available to achieve a goal within a reasonable amount of time. At the same time, the algorithms that made up the HPC applications were designed to take full advantage of the processing capabilities with no limits in sight. Moore’s law supported this idea with impressive innovations and powerful CPUs to keep up with demand and keep the compute-intensive applications happy.
It is fair to say that most companies considered the source-code for their algorithms as the core Intellectual Property (IP) to their business and protected it accordingly. Patenting an algorithm was an effective way to prohibit somebody else from using the same or similar methodology. The algorithm was the differentiator between them and the competition. The combination of the algorithm with an HPC environment gave companies a working model. The benefit was the ability to generate repeatable results with a predictable growth path. However, it is very difficult to build past behavior into rule-based algorithms. The main problem here is that there is no easy or fast method to learn from the past. It takes a lot of manual work to analyze and learn from the past. The data-centric approach delivers a more elegant and effective solution.
The open-source community has been the major technology driver behind the Big Data push. In a sense, delivering the capabilities for a more data-centric approach. That is the ability to create a working model that can predict
future outcomes based on past events. Continuously adding new data will improve the process. Big Data represents the volume of data that is being generated daily, the speed (velocity) at which that data is coming in, different of data formats such as structured or unstructured and with a variation in data quality.
Artificial Intelligence (AI) builds on top of HPC for large-scale compute processing and on Big Data with the support of the open source community.
The real IP now is the data and the main differentiator between competitors. The algorithms that are used in AI are software frameworks that are created and shared by many people. The sharing of ideas and concepts is a key component of the growing success of AI.
With the democratization of data, it is now possible for non-data scientists to collect and analyze data with little help and without the need for a science degree. There are numerous guides and tutorials that will get you started with AI. It doesn’t guarantee success, but it brings AI within greater reach and the drive towards digital self-service. Today, the data is front and center to AI. Consequently, the data used will define success or failure. Having deep knowledge and understanding of the data comes with great responsibility.
Not all data is equal. It is best to start an AI training project with a collection of data that resembles as close possible the data domain to be analyzed. In the end, an AI engine will analyze your data and suggest decisions based on the given training data. If the data quality is low the AI engine will make inaccurate predictions. That doesn’t mean that everything must be perfect. Just make sure that the percentage of so to speak bad data is significantly lower compared to good data.
Let’s say for example that we want to create a model to recognize different kinds of cats in pictures. We might unwillingly have been given a mixed data set of pictures of cats (50%) and dogs (50%). The resulting model will have features from both dogs and cats impacting the accuracy of the model for cats. On the other hand, the presence of a small percentage of dog pictures will not have a major impact on accuracy.
At the start of a new project, the initial data is divided into training and test data. The training data will be used by the AI engine to analyze and generate a model that is a statistical representation of all given training data. The test data is used to test and validate the new model created with the training data. In general, it is typically close to an 80/20 rule for training/test data.
Before data can be used for any type of analytical analysis and deliver insights it needs to be “cleaned”. It is the concept of looking for inconsistencies in the data. Inconsistencies can have a major impact on the analysis and results. Some of the potential sources of bad data can be human error, data mismatch when assimilating multiple sources and missing or duplicate data.
For example, when collecting dates from past earthquakes it includes validating that the date represents a valid date and occurred in the past. Depending on the processing rules the date can be corrected. Or the whole row of data fields associated with the invalid date can be thrown out. After the cleaning follows the “formatting” and labeling of the data. That is to make sure that it is in the right format for the AI tools to process. It does require significant human interaction and time to go through all the data and certainly for unstructured data. The good news is that new tools are brought to market that can help not only automate the process but also reduce human error.
The AI-driven tools analyze the data structure of the data and then predict the kind of errors expected to see in the data. In a typical AI lifecycle, a lot of time is spent on this activity. This process is necessary as new data is continuously added.
In summary, it is all about the data! We live in a data-centric world where data has become extremely valuable. The ability to extract value out of that data with AI has opened many new opportunities. Thankfully, most of them are with good intentions. With AI it is important not to fixate too much on getting as close as possible to 99.9% accuracy. Instead, it is better focusing on improving the overall process and acquiring quality data. It all starts with the data as the algorithm can’t fix bad data. At best it can try to minimize its impact. The data comes with a lot of responsibility. It promises to deliver deep knowledge and understanding from that data.