Analysis of Contemporary Trends in Data Science Applications in Cyber Security

Ayush Srivastav
Trends in Data Science
8 min readMar 28, 2021

--

1. Introduction

Cyber security describes a set of processes, protocols, and techniques designed with the intention of protecting networks, computers, data, and programs from damage, threats, attacks, and unauthorized access (Aftergood, 2017). Cyber security is important in protecting against attacks, hacks, and breaches to individuals, corporations, and government (Greengard, 2016). Pinpointing and blocking the threats from the network has become harder due to alias IP addresses and cloaking techniques. These breaches have become a nuisance to the world causing huge financial loss of over 500 billion dollars each year in 2014 (Greengard, 2016).

Researchers in the domain of cyber security have gone beyond the conventional cat and mouse game and have introduced opportunities that can be utilized to recognize attacks, patterns and malware instantly and dynamically adapt the system (Greengard, 2016). Cybersecurity tools have been increasingly deploying Machine Learning and Data Science. Classification techniques are used to label the type of attack, fraudulent technique detection, and finding injection attacks. Regression analysis finds its use in the identification of unexpected system calls, suspicious HTTP requests, and drawing up analogies between network packet parameters requested and its typical values. Clustering methodologies are applicable in forensic analysis and malware protection (How To Use Applied Data Science And Machine Learning For Cyber Security, 2020). Data Science has revolutionized the field of cybersecurity in more ways than can be thought of.

2. The Traditional Fear, Uncertainty and Doubt Approach for threat Assessment

A large number of organizations highly relied on Fear, Uncertainty and Doubt (FUD) approach for the task of cyber security which is highly unreliable(Maayan, 2020). The FUD approach is based on making assumptions about where attacks are likely to target (Slagell, 2009). Also, the nature of the attack and the technology used in the attack keep evolving. The threats or attacks made on the organization are quite often minor in their initial approach (The Big Connect: How Data Science Is Helping Cyber security, 2019) and they become huge in their impact at a later stage due to privilege escalation. As a result, organizations end up spending huge sums of money on prophecies and exaggerations and are yet unable to identify threats (Curry, 2017).

2.1 Opportunities for Data Science-Driven Solutions and Its Impact

Involving Data Science methodologies could highly improve the task of threat identification. Data Science utilizes data that has been collected from various relevant cyber sources (Sarker et al., 2020). This raw security data can thus be used to analyze a large volume of data and identify patterns to develop data-driven models. Machine Learning-based models are highly efficient in predicting risks based on past exploitation and behavioral patterns from user data (The Big Connect: How Data Science Is Helping Cybersecurity, 2019) and entities (Maayan, 2020). These technologies when deployed will be able to find patterns in data even with minor outliers by establishing a correlation between abnormal user behavior and security attacks and will make threat identification much more robust. The huge capital spent on assumptions would be better utilized on a narrowed-down set of security threats that have their reasoning in scientific methods.

2.2 Challenges and Their Mitigation

A major issue in such a data-driven approach is the availability of recent datasets. The datasets available are usually raw and are not quite updated (Sarker et al., 2020). Security threats are evolving by the second and as a result, the model may not be able to predict the recent type of security threats or understand behavioral patterns. Also, the collection of such data may lead to a variety of privacy issues. To mitigate these issues, a large number of recent datasets must be established by organizations.

Another challenge while developing models is the quality of the datasets. The data is quite often noisy, imbalanced, incomplete, or insignificant (Sarker et al., 2020). This in turn degrades the problem solution and performance of the developed model. These issues can be dealt with by using existing solutions, such as Unsupervised Learning or Semi-supervised learning approach or proposing new algorithms.

3. Failure of existing Intrusion Detection Systems

The traditional Intrusion Detection Systems are not equipped with the right tools to deal with the current cyber industry. These systems are put in place to monitor policy violations or malicious activities in cyberspace. They make use of traditional tools such as user authentication, cryptography, firewalls, anti-viruses, etc (Sarker et al., 2020). They are not able to detect attacks with unidentified signatures and they defend a single target owning to their point-based nature (Thuraisingham et al., 2016). These IDS’s are not much use when it comes to Advanced Persistent Threats (APT’s) and slow and low attacks and these are preferred methods for crime and nation-wide attacks (Thuraisingham et al., 2016).

3.1 Opportunities for Data Science-Driven Solutions and Its Impact

Modern Intrusion Detection Systems have become quite popular as they employ Machine Learning and Deep Learning techniques and are highly efficient in learning from large datasets available in raw form (Ahmad et al., 2021). They make use of stream-based classification and text analysis to identify any ongoing attack as well as the likelihood of any forthcoming threats (Thuraisingham et al., 2016). These systems main a rich Knowledge Base (KB) by dynamically analyzing large streams of data and information to detect any intrusion, or new vulnerabilities and the resources to be protected. Thus, Machine Learning driven Information Detection Systems have been able to detect and mitigate threats quite easily even in large organizational structures.

3.2 Challenges and Their Mitigation

The key issue with any Deep Learning-based framework working on a large datasets is computational expenses. Deep Learning models utilize a large number of Graphics Processing Unit resources and large computational time (Ahmad et al., 2021). However, with the ongoing research in quantum computing and the use of transfer-learning-based models, these issues have been easily mitigated.

Another issue faced is the low performance of such models in a real-world environment. Models developed in laboratories by making use of public datasets are usually not tested in a real-world environment (Ahmad et al., 2021). Thus, they are unaware of the recent trends in cyberspace and as a result, give poor performance. Such challenges have been easily taken care of by testing the effectiveness of the model on real-world modern networks.

4. Protection of Valuable Information

The weaker protocols set in place pose a threat to the protection of data. These data items are of immense value to an organization and they may be lost to attackers due to weaker security protocols (Maayan, 2020). The key business information being leaked or stolen has cost the world a sum of $ 600 billion in 2018 (The Big Connect: How Data Science Is Helping Cybersecurity, 2019).

4.1 Opportunities for Data Science-Driven Solutions and Its Impact

Data Science however enforces advanced methodologies to ensure data protection. By ensuring highly complex signatures and encryption technologies, access to data may be probed (Trikha, 2018). With the help of Data Science, highly impenetrable protocols can be set in place by analyzing the set of data that is highly targeted by attackers (Trikha, 2018). Data backup technologies have been automated and Machine Learning models have been put in place to ensure that the requirements and priorities of security plans are being properly followed to create a backup of the highly sensitive data in the first place (Maayan, 2020). Such automated systems thus ensure that the important and sensitive information is backed up at regular intervals and access to such data is highly secure.

4.2 Challenges and Their Mitigation

Understanding and designing an algorithm for such a specialized task could be a big problem. While security experts make use of Machine Learning for data protection, attackers may use Artificial Intelligence based algorithms to flood the model with false negatives thereby causing the analysts to ignore the threats (Foote, 2019). Experts are designing highly robust models with the capabilities to analyze and identify such anomalies in false negatives and generate an appropriate automated response.

5. The “HOW” of an Attack: Contextual Awareness

The “how” of an attack is equally important as the “what”. Traditional vulnerability detectors and antiviruses have the sole motive of identifying the threat from the security system and then removing it from the network. However, the how of an attack, i.e., factors that can lead to the attack and the characteristics of the attack may help identify and predict the attack before it happens (How To Use Applied Data Science And Machine Learning For Cyber Security, 2020). Thus, these systems will be able to predict any forthcoming attack based on awareness of the context of the attack.

5.1 Opportunities for Data Science-Driven Solutions and Its Impact

Data Science driven solutions have helped in creating a far better, contextually aware system for the detection of attacks. Information such as spatial, temporal, relationship among connections or events, dependencies may be analyzed to determine if the activity is suspicious or not (Sarker et al., 2020). Additional features such as the attacker’s entry point, data accessed by the attacker, footprints of the attacker within the network further increases the likelihood of blocking the attackers from the network (How To Use Applied Data Science And Machine Learning For Cyber Security, 2020). Thus, minor issues, such as Denial of Service (DoS) attacks which may be overlooked by the system, may be detected and fully eliminated from the network by the system.

5.2 Challenges and Their Mitigation

The primary challenge in the deployment of a Contextually Aware system is data privacy. Each node in such a system has to secure the information being exchanged as well as the location of the information (Almutairi, 2012). The information must not be available to the public. Developing such a system is a highly complex task. However, many researchers have come up with multiple frameworks for such a system (Almutairi, 2012).

6. Conclusion

“Data Science has become the eyes to cyber security's sword” (Torres, 2017). Cyber security Data Science aims at adopting a scientific approach for the identification of hostile attacks on cyberspace and the entire digital infrastructure. By making use of advanced outlier detection methodologies, anomaly identification has become quite easy as they are the key factors in the recognition of an attack. Data Scientists, thus, offer cyber security professionals with key intelligence that enables them to counter-attack and defend in a better manner.

References

Aftergood, S. (2017). The cold war. Nature, Cybersecurity: The Cold War Online, 547(7661):30.

Ahmad, Z., Shahid Khan, A., Wai Shiang, C., Abdullah, J., & Ahmad, F. (2021). Network intrusion detection system: A systematic study of machine learning and deep learning approaches. Transactions on Emerging Telecommunications Technologies, 32(1), 1–29. https://doi.org/10.1002/ett.4150

Almutairi, S. (2012). Review on the Security Related Issues in Context Aware System. International Journal of Wireless & Mobile Networks, 4(3), 195–204. https://doi.org/10.5121/ijwmn.2012.4313

Curry, S. (2017). Cut the FUD: Why Fear, Uncertainty and Doubt is harming the security industry. HelpNet Security. https://www.helpnetsecurity.com/2017/11/29/fud-cybersecurity/

Foote, K. D. (2019). Artificial Intelligence, Machine Learning, and Data Protection. Dataversity. https://www.dataversity.net/artificial-intelligence-machine-learning-and-data-protection/

Greengard, S. (2016). Cybersecurity gets smart. Communications of the ACM, 59(5), 29–31. https://doi.org/10.1145/2898969

How To Use Applied Data Science And Machine Learning For Cyber Security. (2020). Entrust Solutions. https://www.entrustsolutions.com/2020/06/30/applied-data-science-and-machine-learning-for-cyber-security/#:~:text=Data science helps to ensure,Labeling types of cyber attacks.

Maayan, G. (2020). How Data Science has Changed Cybersecurity. Data Science Dojo. https://blog.datasciencedojo.com/data-science-changed-cybersecurity/

Sarker, I. H., Kayes, A. S. M., Badsha, S., Alqahtani, H., Watters, P., & Ng, A. (2020). Cybersecurity data science: an overview from machine learning perspective. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-020-00318-5

Slagell, A. (2009). Fear, Uncertainty and Doubt: The Pillars of Justification for Cyber Security. The Amazing Meeting 7. http://www.slagell.info/Adam_J._Slagell/Publications_files/TAM7.pdf

The Big Connect: How Data Science is Helping Cybersecurity. (2019). Infosecurity Magazine. https://www.infosecurity-magazine.com/blogs/data-science-helping-cybersecurity-1/

Thuraisingham, B., Kantarcioglu, M., Hamlen, K., Khan, L., Finin, T., Joshi, A., Oates, T., & Bertino, E. (2016). A data driven approach for the science of cyber security: Challenges and directions. Proceedings — 2016 IEEE 17th International Conference on Information Reuse and Integration, IRI 2016, 1–10. https://doi.org/10.1109/IRI.2016.10

Torres, P. (2017). Data Science for Cyber Security. Medium. https://medium.com/codex/data-science-for-cyber-security-32e2f81e15d3

Trikha, A. (2018). Role of Data Science in Cyber Security. Dataversity. https://www.dataversity.net/role-data-science-cyber-security/#

--

--