Research based context

Introduction

(From Canvas) You apply critical thinking in your day-to-day work. In your planning, you can divide your work into questions which need investigation. Each investigation has a goal which can be validated and is relevant and valuable for your specific context. You use a well-known methodology (for instance the DOT framework) to structure your investigation. The result of the investigation is validated by you and shows the quality and value of the result.

The result of your investigation can be justified and presented by you, both verbally and in writing. Others can validate the results, making the results transferable to others.One of the ongoing investigations is: keeping an eye on the current state of development of your products (for instance using the Technological Readiness Level).

Learning focuses

In order to shape the upcoming curriculum, I’ve chosen various learning focuses for Research-based-context. These are work in progress, and have to be developed out further.


Category

In the tables below the category tab depicts the nature of the skill concerning the listed task.
Additionally to the standard, I’ve expanded with a custom table with tasks I came up with.
  • T = Technical skills

  • N = Non-technical skills

  • R = Research & development skills

  • P = Professional skills

Learning tasks

Task#

Category

Requirement

Status

Description

#0

RPN

Must

Done

Setup main research question

#1

NP

Must

Done

Create sub questions

#2

NR

Must

Done

Ethics & legal literature study

#3

T

Must

Done

Data analytics

#4

T

Must

Done

Prototyping & product review

#4

T

Must

Done

Integrating into FaaS

#5

NP

Must

Done

Conclusion

Context

This assignment was made in the context of the various learning goals in the semester six software engineering at the Fontys Hogescholen. My research will be based around Machine Learning – specifically in the domain of Cyber Security & Software Engineering. To answer the research questions various techniques including Machine Learning will be used. The research will be focus on implementing a machine learning tool that can determine password strength.

.DOT framework is applied to the research in order to ensure a reliable implementation.

https://i.imgur.com/2X9y3Xl.png

By combining various methods of the framework, it is possible to triangulate the research and ensure a successful implementation.

Main Research Question

Task analysis

https://ictresearchmethods.nl/images/thumb/d/d4/Logo-field.png/50px-Logo-field.png

OLD: Is there a correlation between bad passwords & domain names

After receiving feedback from teachers, the main research question had changed. These were the feedback points:

  • This solution would deliver a smart’ password complexity policy (too vague & no use case context)

  • Design a tool to enable a smart custom password filtering policy upon user registration based on domain names(scope too big)

  • Design a tool to enable Ramses users upon account creation that validates the complexity of their passwords based on Machine Learning results (good)

NEW: Design a tool to enable Ramses users upon account creation that validates the complexity of their passwords based on Machine Learning results

Sub questions

  • Is it legal to acquire this dataset

Applied method: Literature study

  • Where to acquire datasets that can be used to train the algorithm:

Applied method: Community Research

Applied method: Available product analysis

  • What makes a password bad

Applied method: Data analytics

  • How to apply Machine Learning in this context

Applied method: Product review

Applied method: Prototyping

  • How to integrate machine learning password project into an Azure FaaS

Applied method: Workshop

Where to acquire datasets that can be used to train the algorithm

In order to train a Machine Learning algorithm, a dataset must first be acquired. Since the requirement of the research result is based on a string input (the password) an implementation of TD-IDF is required. TD-IDF, also known as Term Frequency-Inverse Document Frequency is a way in which machine learning can be used to translate natural language into numbers using a process called vectorization. It’s a fundamental step in machine learning. Using TD-IDF it is possible to assign a numerical statistic and essentially give a string a score.

Community research

https://ictresearchmethods.nl/images/thumb/8/87/Logo-library.png/50px-Logo-library.png

What makes a password bad

Data analytics

https://ictresearchmethods.nl/images/thumb/a/ac/Logo-lab.png/50px-Logo-lab.png

The user story for this research requires a well functioning interface. Thus, one of the requirements is being able to determine the password strength not solely based off a score, but also convert this score into something that can be more easily interpreted by a user. If you take a look at the Kaggle password strength classifier dataset, you can see that they marked the strength in numerical values. The interface being provided can in turn translate these numerical values into strings. Here are examples:

Examples of conversion

Password

ML-dataset score

Interface conversion

“ok>bdk”

0

Weak password

“visi7k1yr”

1

Medium password

“b4NbTxDEyNgG141J”

2

Strong password

How to Apply Machine Learning in this context

Available product analysis

https://ictresearchmethods.nl/images/thumb/8/87/Logo-library.png/50px-Logo-library.png

Before being able to apply machine learning, a list of available products needs to be composed–and then these need to be reviewed. This is the most effective by leveraging open source communities and drawing a list of products. The search begins on GitHub: https://github.com/search?q=password+strength+machine+learning

https://i.imgur.com/ppyd2k1.png

Though there are requirements, that being that the tool can be used to quickly scan strings and give them a score based on a training set used in combination with machine learning. Aside from this, it is also important that the product uses a language in which the author understands & can iterate further on & configure its settings. Possible products have been selected based on searching in the open source community:

Available products

Title

Language

Author/creator

URL

Machine-Learning-based-Password-Strength-Classification

Python

faizann24

https://github.com/faizann24/Machine-Learning-based-Password-Strength-Classification

Password-Strength-Evaluator-using-Machine-Learning

Python, Html, Flask

OmkarPathak

https://github.com/OmkarPathak/Password-Strength-Evaluator-using-Machine-Learning

ExoPassword

Jupyter Notebook, Javascript, Html, Python

apratimshukla6

https://github.com/apratimshukla6/ExoPassword

SmartPass

Python, Html, Javascript

AgnellusX1

https://github.com/AgnellusX1/SmartPass

Product review

https://ictresearchmethods.nl/images/thumb/2/22/Logo-showroom.png/50px-Logo-showroom.png

In order to determine which tool will be used to implement into the project, a list of requirements has been setup and leveraged against the pro’s & con’s. From this table, it can be concluded that 2 is the best fitting for the project and its integration. In order to determine these things, the codebases have been researched and the results are seen below.

Product review

#

Tool

Language requirement

Sample training dataset

Documentation

Open source

1

Machine-Learning-based-Password-Strength-Classification

Met

Included

None

Yes

2

Password-Strength-Evaluator-using-Machine-Learning

Met

Included

Setup

Yes

3

ExoPassword

Not met

Included

Setup

Yes

4

SmartPass

Met

Not raw

Design approach

Yes

The conclusion that can be drawn here is that #2 Password-Strength-Evaluator-using-Machine-Learning met the most requirements, thus based on this a prototype has to be created.

Prototyping

https://ictresearchmethods.nl/images/thumb/e/ea/Logo-workshop.png/50px-Logo-workshop.png

In this sub-section I will be working out a prototype based on previous research. I first implemented a PoC that runs a checker against the haveibeenpwned.com database.

import requests
import hashlib
import sys

base_url = 'https://api.pwnedpasswords.com/range/'

#request to API
def request_api_data(query_char):
    response = requests.get(base_url + query_char)
    if response.status_code != 200:
        raise RuntimeError(f'Error in fetching: {response.status_code}, pls check API and try again.')
    else:
        return response


#read response
def get_password_leaks_count(hashes, hash_to_check):
    hashes = (line.split(':') for line in hashes.text.splitlines())
    for h,c in hashes:
        if h == hash_to_check:
            return c
    return 0


#main function
def pwned_api_check(password):
    #check password if it exists in API response
    sha1_password = hashlib.sha1(password.encode('utf-8')).hexdigest().upper()
    first_five_chars,tail = sha1_password[:5],sha1_password[5:]

    #call API
    resp = request_api_data(first_five_chars)
    return get_password_leaks_count(resp, tail)


def main(args):
    for password in args:
        count = pwned_api_check(password)
        if count:
            print(f'{password} has been leaked {count} times. You should probably changed your password.')
        else:
            print(f'{password} was not found. Carry on.')
    print('Program Completed!')

if __name__ == '__main__':
    sys.exit(main(sys.argv[1:]))

#how to run
#python checker.py password123 abc123

This could then be run in tangent with the ML tool, if required.

https://i.imgur.com/J3igiQH.png

I then implemented Machine-Learning-based-Password-Strength-Classification from GitHub and ran the base python script

https://i.imgur.com/NUzZqeC.png

From the web-interface I can test out my password against the machine learning algorithm that is built on a dataset of 50000 passwords. In which the data-set is formatted like this:

u96qdgcrRhInMn9|2
tXxr2$pnv&&vG'j'8|3
T2kfb|1
tpLvdwcO|0
grdXx|0
PQxfUqt2lj|2
q,,yeuelq|1
zAErCzhqe|1
Xa5Z"v0"|3
c2mnyzjmi3Evedvz3pd|2

And here is the implemented PoC in action:

https://i.imgur.com/86UOPXm.png

How to integrate machine learning password project into an Azure FaaS

The next step in the research is to integrate the Python ML project into an Azure Function as a Service; see the full explanation of FaaS in my cloud page learning goal, a short version can be found below:

FaaS

https://ictresearchmethods.nl/images/thumb/e/ea/Logo-workshop.png/50px-Logo-workshop.png https://ictresearchmethods.nl/images/thumb/8/87/Logo-library.png/50px-Logo-library.png

These days, “serverless” and “functions-as-a-service” (FaaS) have found themselves at the early side of the hype cycle. Some folks have gone so far to say that serverless and functions are the next evolution of microservices, and that you should just skip the whole microservices architecture trend and go right to serverless. Just as you’d expect, this is a bit hyperbolic; by which you shouldn’t be surprised. There’s a lot of exciting, new technology available to us as architects and developers to improve our ability to achieve our business outcomes. What we need is a pragmatic lens through which to judge and apply these new technologies. Although as technology practitioners, it’s our responsibility to keep up with the latest technology, it’s equally our responsibility to know when to apply it in the context of our existing technology and IT departments.

Source

Not covered by this text is my own theory; the largest driving factor for optimization in cloud deployments are costs. Simply explained, a function (running on Azure for example) doesn’t require a constant container being ran by the cloud. It will, upon being requested wake up and serve the function over the cloud and then gracefully exit once the function is done being executed.

FaaS integration

The next step is integrating this ML project into a Function as a Service. For more context, please visit the Distributed Data learning goal to have an understanding of why FaaS is important. In order to implement the ML project into a FaaS, the training models & testing data had to be placed in the cloud. And to do this in Azure FaaS I had to drag the models to the root folder, where it could be read by the algorithm.

After implementing this into the already existing FaaS the project I re-uploaded the FaaS to the cloud and saw the results:

https://i.imgur.com/SNOTztT.png https://i.imgur.com/Z3uOnbJ.png

Conclusion

Traditionally, when creating a new account websites will contain strict rules about the complexity of the password (e.g. must contain special character, upper-case, etc.). By using Machine Learning we can interpret a large data-set of predefined passwords which we tag by hand. And then instead of using traditional means of checking password, we instead feed the algorithm a dataset with passwords that are easily crackable–despite meeting the requirements of the website.

Function as a Service, is the new evolution of the old Monolith. It is optimized for cloud usage, and with an appropriate use case (such as a password checker) it is a perfect solution for a cost redundant software container.

Advice

It is important for a business to look at their deployment of various containers in a critical way; and analyze what they can repackage into a FaaS and bundle important functions into them, such as password checking. Furthermore solely using Machine Learning to check the password strength is only half the work; this can be improved by many more layers of security, that for example checks how many times a password has been breached. And a static check that ensures that the passwords contains enough complexity in case the machine learning algorithm favors a weaker and more easily cracked password.

References

FaaS vs. Microservices - DZone Microservices Christian Posta · et al. https://dzone.com/articles/faas-vs-microservices

Art. 139d lid 2 Sr - Artikel 139d lid 2 Wetboek van Strafrecht Maxius.nl https://maxius.nl/wetboek-van-strafrecht/artikel139d/lid2

Richtlijn voor strafvordering cybercrime (2018R001) - om Richtlijn voor strafvordering cybercrime (2018R001) Categorie: … College van procureurs-generaal: Aan: Hoofden van de OM-onderdelen: Registratienummer: 2018R001: Datum inwerkingtreding: 1 februari 2018: Publicatie in Staatscourant: 2018, Nr. 3271: Relevante beleidsregels: Aanwijzing kader voor strafvordering meerderjarigen (2019A003 … https://www.om.nl/onderwerpen/beleidsregels/richtlijnen-voor-strafvordering-resultaten/

Password Strength Classifier Dataset bansal https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset