Orchestrating a Data Heist using Machine Learning

Rhett Greenhagen
9 min readAug 19, 2020

Artificial Intelligence (AI), Machine Learning and Data Science are major buzzwords that have seen a surge in popularity lately. Businesses and governments are increasingly leveraging their applications to “learn” more about their targets i.e. people. The Facebook-Cambridge Data Scandal that came to light in 2018 led to serious conversations about how data had basically become a commodity. Had the masses observed closely, it exposed how they had been “socially engineered” to give up their information.

Social Engineering, paired with AI, could be used to administer lethal attacks on machines which could force it to give any information that it holds. The combination works well because it isn’t strictly a “technical” attack. It involves exploiting the human psyche and using that extracted info with the right technology to launch attacks. To further explain this, think of it in terms of the extremely popular heist movie series called Ocean’s 11. In the film, the crew collects as much information on its targets and uses the right people (resources) to steal the item they intended to.

So, What Exactly is Machine Learning?

“Codifying a strict set of instructions to solve a problem VS Trying to raise a child that will want to do the thing you want it to do — and they will subvert your intentions at every opportunity, trying to shortcut everything to find the simplest through line”

-Rhett Greenhagen

Contrary to popular opinion, Machine learning (ML) is actually a subset of AI. The purpose of Machine Learning is to predict the outcome of a particular data set. So, if one were to input a set of data into a ML algorithm, it can be used to predict the potential outcome values associated with that data set. The application of ML transcends to detecting fraud, evaluating business processes and filtering spam. The applications listed here can only be applied once an ML Model is built around it. These models, once trained, learn from new data pertaining to its application and require a certain level of trust on the developer’s end. The Quote mentioned above is apt in this regard because techniques like ML debugging helps once understand how it actually works. This provides a sense of security to the humans operating it which further helps them understand it.

Machine Learning as a Service

Building upon the idea of an ML Algorithm, ML has seen a renewed interest from businesses. Aimed at solving problems, businesses saw the opportunity to capitalize on ML services to predict the outcomes of their decisions. This further renewed interest among the top “dogs” like Microsoft, Amazon to building their own ML services. The top ML service providers are: -

1. Amazon’s AWS (Amazon Web Services)

2. Microsoft’s Azure

3. IBM’s Watson

4. Google’s Cloud Machine Learning Engine

The question that arises here is How Does it work? Without being too technical, an ML model is trained/developed using an algorithm. This algorithm contains the input data, which forms the basis of the ML model. The input data has to contain the expected outcome (target) and once the model is deployed, the model simulates the patterns that ultimately provides the targeted outcome. Machine Learning as a Service uses Quality prediction API’s and along with a data set, works in a “Black Box” style configuration that continues to engage with the API’s.

Model Stealing Attacks

While these models may seem sophisticated, they are also susceptible to attacks. Fully developed ML models can be exploited and can be used to spit out facts that even its owners couldn’t fathom. An ML model, also referred to as a “Black Box” by its creators, can be compromised by techniques known as “Blackbox queries”. Major applications of this attack pertains to stealing stock market prediction models and for developing a spam filtering model (model to filter mails that classify as spam).
According to Rhett Greenhagen, there exist models which can be used to steal stock market prediction models. These can be used to influence stock market prices and extract millions of dollars from the companies that developed these models. Greenhagen touched upon the obsession of investors to influence the stock market prices and he provided details of a model built for it.

Raw Data in the hands of a competent ML connoisseur can do inflict more damage than in the hands of a regular stock trader. It can be used to extract patterns and it goes out to show how defenseless these models are to external behavior.

How Is It Done?

To understand how any of this works, it is imperative that one is proficient with the concept of computers and networks and their interaction. With that out of the way, turn your attention back to the stock market stealing model. To execute the model, reconstruction attacks can be administered by probing public/private API’s to simulate a stock market prediction mode. By taking the stock market data of say, the last 15 years would give one an idea of the patterns associated with it. An ML model allows you to create your own rules.

This goes a long way in alerting you whenever a movement takes place in the stock market. So, if the value of a stock changes by a set percentage, the model will alert its creator. The creator can further set an action to take place once the stock changes by that particular percentage. Actions in this context pertain to buying and selling of the stock. So, anyone using this stock market prediction model can invest a relatively smaller amount and walk out with potentially triple the amount in a matter of minutes. The ML model used by the trading houses is basically compromised and “stolen”, without them realizing about it. This model also plays into the psychology of those making the decision to buy and sell because of the limited windows afforded to them while trading. To summarize it in technical terms, you start by gathering your variables (dependent and independent variables). After doing that, you proceed to train the ML model from available data. Once this is done, the ML model can be either published used as per the developer’s convenience.

Another application of this is possible with respect to Google Translate. By understanding the predictive nature of Google Translate, an ML stealing model can be used to source info from these services to develop another service of similar nature. A Tesla is a prime example of this as well, given its heavy reliance on AI.

So, what can you learn from this example?

Data is malleable. It can be manipulated and that blurs the lines between whether one should strictly believe the data presented to them. But in a trading setting, one can’t really be this cautious. The SEC i.e. Security and Exchange Commission has safeguards in place for this exact purpose, which makes it reliable to believe this data.

But think of the implications this attack has on the economy. It negatively affects the economy for as long as these models stay undetected. If this became the trend, it would rob the economy and investors of millions of dollars. Not only that, it also hits out at the Intellectual Property that these trading houses/companies have spent a huge chunk of resources developing. When someone tampers with these models, it also becomes hard to prove these attacks in court, thus destroying the investment made in an instant.

How do you Protect Yourself?

Upon reading this, you might be tempted to build your own model. But there is a good chance that it will miserably backfire on you. The complexity of ML models might throw off an ML model with questionable data, implying that it takes a lot of skill to work with ML models. While it may not affect majority of people, its threat still looms large over them, especially with the rise in Streaming Services. Streaming Services use ML models and given people’s obsession with Netflix and other streaming services; it does make them susceptible to attacks.

Companies have also taken notice of this and have invested a huge chunk of capital into developing competent systems. This is noticeable in the form of extra layers of security, regular pen-testing of their services and influencing its users to follow good security practices. Asking the right questions about technology will help one steer clear of danger.

Adopting an Anomaly Detection Algorithm

Adopting an Anomaly Detection Algorithm is an ideal first line of defense. As the name suggests, it detects anomalies in an ML model as new attacks imply that new anomalies will be created in the model’s life cycle. Its validity has further been proved in the gambling industry.

Develop an Incident Response Process

Once you have detected an anomaly in ML Algorithm the first thing that will tell you how to react is an incident response process. You need to have this in place for your own security. This will tell you how to respond when there is a sudden attack. This is the best strategy to minimize the loss.

Regularly Evaluate safety Models

Revisiting the stock market example, it is not an easy attack to administer. It takes a lot of effort and skill and is often pulled off by corporations. These groups have the motivation and the capital to make it happen. In essence, it is a form of insider trading. These groups reap huge profits off of one stock and with the entire stock market models in their hands, they can wreak havoc on the economy and investors.

Since these models can also be used to steal traditional models developed by companies, companies lose their control over their Intellectual Property. They’ve basically lost ownership of their model and their product and services. Therefore, these companies need to regularly evaluate their models to avoid a loss of such magnitude. An unorthodox idea in this regard might be to work with someone who’s administered such attacks. This will give them insights into the trends of these attacks and will equip them to be prepared for the worst. This conforms with what Rhett Greenhagen stated in his keynote, as he revealed that he worked with the NSA and the CIA.

Final Thoughts

“As far as we know, this type of attack is not being carried out in the real world by malicious parties right now,” says Anish Athalye, a researcher at MIT. “But given all the research in this area, it seems that many machine learning systems are very fragile, and I wouldn’t be surprised if real-world systems are vulnerable to this kind of attack.” The world of AI is a complex one. While it has many applications that can be used for the good of the society, it does have a dark side to it as well. While the average user may not be interested in understanding how AI actually works, they should take an interest. Technology evolves every day and that generates data every second. Claiming ownership to data gives firms like Facebook power as they earn their revenue by selling data to advertisers. These advertisers can do anything with a user’s data and that is a scary thought in itself.

What’s also worth noting here is that the field of AI is also evolving which makes it a double-edged sword. On one side, AI could potentially become the catalyst that helps thwart off malicious activities. The flip side of it is that it could result in more cyber-attacks. Pair it up with an attack like Social engineering, DDoS attacks and that is enough cause for caution. AI certainly won’t go away. Its benefits far outweigh the capital needed to work with it. It’s just a matter of being aware about what’s going on in this field, irrespective of whether someone follows technology or not.

Resources:

https://www.usenix.org/sites/default/files/conference/protected-files/security16_slides_tramer.pdf

https://elie.net/blog/ai/attacks-against-machine-learning-an-overview/

--

--

Rhett Greenhagen

“I am always ready to learn although I do not always like being taught.” — Winston Churchill