Research based context¶
Table of contents
Introduction¶
(From Canvas) You apply critical thinking in your day-to-day work. In your planning, you can divide your work into questions which need investigation. Each investigation has a goal which can be validated and is relevant and valuable for your specific context. You use a well-known methodology (for instance the DOT framework) to structure your investigation. The result of the investigation is validated by you and shows the quality and value of the result.
The result of your investigation can be justified and presented by you, both verbally and in writing. Others can validate the results, making the results transferable to others.One of the ongoing investigations is: keeping an eye on the current state of development of your products (for instance using the Technological Readiness Level).
Learning focuses¶
In order to shape the upcoming curriculum, I’ve chosen various learning focuses for Research-based-context. These are work in progress, and have to be developed out further.
Category¶
T = Technical skills
N = Non-technical skills
R = Research & development skills
P = Professional skills
Learning tasks¶
Task# |
Category |
Requirement |
Status |
Description |
|---|---|---|---|---|
#0 |
RPN |
Must |
Done |
Setup main research question |
#1 |
NP |
Must |
Done |
Create sub questions |
#2 |
NR |
Must |
Done |
Ethics & legal literature study |
#3 |
T |
Must |
Done |
Data analytics |
#4 |
T |
Must |
Done |
Prototyping & product review |
#4 |
T |
Must |
Done |
Integrating into FaaS |
#5 |
NP |
Must |
Done |
Conclusion |
Context¶
This assignment was made in the context of the various learning goals in the semester six software engineering at the Fontys Hogescholen. My research will be based around Machine Learning – specifically in the domain of Cyber Security & Software Engineering. To answer the research questions various techniques including Machine Learning will be used. The research will be focus on implementing a machine learning tool that can determine password strength.
.DOT framework is applied to the research in order to ensure a reliable implementation.
By combining various methods of the framework, it is possible to triangulate the research and ensure a successful implementation.
Main Research Question¶
Task analysis¶
OLD: Is there a correlation between bad passwords & domain names
After receiving feedback from teachers, the main research question had changed. These were the feedback points:
This solution would deliver a smart’ password complexity policy (too vague & no use case context)
Design a tool to enable a smart custom password filtering policy upon user registration based on domain names(scope too big)
Design a tool to enable Ramses users upon account creation that validates the complexity of their passwords based on Machine Learning results (good)
NEW: Design a tool to enable Ramses users upon account creation that validates the complexity of their passwords based on Machine Learning results
Sub questions¶
Is it legal to acquire this dataset
Applied method: Literature study
Where to acquire datasets that can be used to train the algorithm:
Applied method: Community Research
Applied method: Available product analysis
What makes a password bad
Applied method: Data analytics
How to apply Machine Learning in this context
Applied method: Product review
Applied method: Prototyping
How to integrate machine learning password project into an Azure FaaS
Applied method: Workshop
Is it legal to acquire the dataset¶
Before this question can be answered, it must first be clear what an illegal dataset looks like.
Literature study¶
The OM (Dutch Justice department) has a list of guidelines published in relation to Cyber Crime. A table is present on this page giving examples of the types of punishment corresponding to the crime being comitted. This data chart assumes that there’s only one offender, and it’s their first time offending.
What this means is that Having malware/ passwords/ login-information at your disposal is an offense that can result in being jailed for 2 to 3 weeks.
https://maxius.nl/wetboek-van-strafrecht/artikel139d/lid2
b. een computerwachtwoord, toegangscode of daarmee vergelijkbaar gegeven waardoor toegang
kan worden gekregen tot een geautomatiseerd werk of een deel daarvan, vervaardigt verkoopt,
verwerft, invoert, verspreidt of anderszins ter beschikking stelt of voorhanden heeft.
Explicitly this states that having an access code, password, or something similar which would grant access to a “automatized” environment or part thereof cannot be (the credentials) sold, handed over, filled in(used), spread or in other means made available to others or made available for themselves is illegal.
An example of this law being used in the courtroom:
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:RBROT:2020:2395
The plaintiff was given a jail sentence lasting 180 and a probation term of 3 years. Due note, that in this case there were many more laws being broken.
Since the intention of this research isn’t for the author to become liable, instead of using passwords or tokens that would grant access to such environments, dummy-data has to be used–or, the dataset may not contain log-in information.
Where to acquire datasets that can be used to train the algorithm¶
In order to train a Machine Learning algorithm, a dataset must first be acquired. Since the requirement of the research result is based on a string input (the password) an implementation of TD-IDF is required. TD-IDF, also known as Term Frequency-Inverse Document Frequency is a way in which machine learning can be used to translate natural language into numbers using a process called vectorization. It’s a fundamental step in machine learning. Using TD-IDF it is possible to assign a numerical statistic and essentially give a string a score.
Community research¶
What makes a password bad¶
Data analytics¶
The user story for this research requires a well functioning interface. Thus, one of the requirements is being able to determine the password strength not solely based off a score, but also convert this score into something that can be more easily interpreted by a user. If you take a look at the Kaggle password strength classifier dataset, you can see that they marked the strength in numerical values. The interface being provided can in turn translate these numerical values into strings. Here are examples:
Password |
ML-dataset score |
Interface conversion |
|---|---|---|
“ok>bdk” |
0 |
Weak password |
“visi7k1yr” |
1 |
Medium password |
“b4NbTxDEyNgG141J” |
2 |
Strong password |
How to Apply Machine Learning in this context¶
Available product analysis¶
Before being able to apply machine learning, a list of available products needs to be composed–and then these need to be reviewed. This is the most effective by leveraging open source communities and drawing a list of products. The search begins on GitHub: https://github.com/search?q=password+strength+machine+learning
Though there are requirements, that being that the tool can be used to quickly scan strings and give them a score based on a training set used in combination with machine learning. Aside from this, it is also important that the product uses a language in which the author understands & can iterate further on & configure its settings. Possible products have been selected based on searching in the open source community:
Title |
Language |
Author/creator |
URL |
|---|---|---|---|
Machine-Learning-based-Password-Strength-Classification |
Python |
faizann24 |
https://github.com/faizann24/Machine-Learning-based-Password-Strength-Classification |
Password-Strength-Evaluator-using-Machine-Learning |
Python, Html, Flask |
OmkarPathak |
https://github.com/OmkarPathak/Password-Strength-Evaluator-using-Machine-Learning |
ExoPassword |
Jupyter Notebook, Javascript, Html, Python |
apratimshukla6 |
|
SmartPass |
Python, Html, Javascript |
AgnellusX1 |
Product review¶
In order to determine which tool will be used to implement into the project, a list of requirements has been setup and leveraged against the pro’s & con’s. From this table, it can be concluded that 2 is the best fitting for the project and its integration. In order to determine these things, the codebases have been researched and the results are seen below.
# |
Tool |
Language requirement |
Sample training dataset |
Documentation |
Open source |
|---|---|---|---|---|---|
1 |
Machine-Learning-based-Password-Strength-Classification |
Met |
Included |
None |
Yes |
2 |
Password-Strength-Evaluator-using-Machine-Learning |
Met |
Included |
Setup |
Yes |
3 |
ExoPassword |
Not met |
Included |
Setup |
Yes |
4 |
SmartPass |
Met |
Not raw |
Design approach |
Yes |
The conclusion that can be drawn here is that #2 Password-Strength-Evaluator-using-Machine-Learning met the most requirements, thus based on this a prototype has to be created.
Prototyping¶
In this sub-section I will be working out a prototype based on previous research. I first implemented a PoC that runs a checker against the haveibeenpwned.com database.
import requests
import hashlib
import sys
base_url = 'https://api.pwnedpasswords.com/range/'
#request to API
def request_api_data(query_char):
response = requests.get(base_url + query_char)
if response.status_code != 200:
raise RuntimeError(f'Error in fetching: {response.status_code}, pls check API and try again.')
else:
return response
#read response
def get_password_leaks_count(hashes, hash_to_check):
hashes = (line.split(':') for line in hashes.text.splitlines())
for h,c in hashes:
if h == hash_to_check:
return c
return 0
#main function
def pwned_api_check(password):
#check password if it exists in API response
sha1_password = hashlib.sha1(password.encode('utf-8')).hexdigest().upper()
first_five_chars,tail = sha1_password[:5],sha1_password[5:]
#call API
resp = request_api_data(first_five_chars)
return get_password_leaks_count(resp, tail)
def main(args):
for password in args:
count = pwned_api_check(password)
if count:
print(f'{password} has been leaked {count} times. You should probably changed your password.')
else:
print(f'{password} was not found. Carry on.')
print('Program Completed!')
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
#how to run
#python checker.py password123 abc123
This could then be run in tangent with the ML tool, if required.
I then implemented Machine-Learning-based-Password-Strength-Classification from GitHub and ran the base python script
From the web-interface I can test out my password against the machine learning algorithm that is built on a dataset of 50000 passwords. In which the data-set is formatted like this:
u96qdgcrRhInMn9|2
tXxr2$pnv&&vG'j'8|3
T2kfb|1
tpLvdwcO|0
grdXx|0
PQxfUqt2lj|2
q,,yeuelq|1
zAErCzhqe|1
Xa5Z"v0"|3
c2mnyzjmi3Evedvz3pd|2
And here is the implemented PoC in action:
How to integrate machine learning password project into an Azure FaaS¶
The next step in the research is to integrate the Python ML project into an Azure Function as a Service; see the full explanation of FaaS in my cloud page learning goal, a short version can be found below:
FaaS¶
These days, “serverless” and “functions-as-a-service” (FaaS) have found themselves at the early side of the hype cycle. Some folks have gone so far to say that serverless and functions are the next evolution of microservices, and that you should just skip the whole microservices architecture trend and go right to serverless. Just as you’d expect, this is a bit hyperbolic; by which you shouldn’t be surprised. There’s a lot of exciting, new technology available to us as architects and developers to improve our ability to achieve our business outcomes. What we need is a pragmatic lens through which to judge and apply these new technologies. Although as technology practitioners, it’s our responsibility to keep up with the latest technology, it’s equally our responsibility to know when to apply it in the context of our existing technology and IT departments.
Not covered by this text is my own theory; the largest driving factor for optimization in cloud deployments are costs. Simply explained, a function (running on Azure for example) doesn’t require a constant container being ran by the cloud. It will, upon being requested wake up and serve the function over the cloud and then gracefully exit once the function is done being executed.
FaaS integration¶
The next step is integrating this ML project into a Function as a Service. For more context, please visit the Distributed Data learning goal to have an understanding of why FaaS is important. In order to implement the ML project into a FaaS, the training models & testing data had to be placed in the cloud. And to do this in Azure FaaS I had to drag the models to the root folder, where it could be read by the algorithm.
After implementing this into the already existing FaaS the project I re-uploaded the FaaS to the cloud and saw the results:
Conclusion¶
Traditionally, when creating a new account websites will contain strict rules about the complexity of the password (e.g. must contain special character, upper-case, etc.). By using Machine Learning we can interpret a large data-set of predefined passwords which we tag by hand. And then instead of using traditional means of checking password, we instead feed the algorithm a dataset with passwords that are easily crackable–despite meeting the requirements of the website.
Function as a Service, is the new evolution of the old Monolith. It is optimized for cloud usage, and with an appropriate use case (such as a password checker) it is a perfect solution for a cost redundant software container.
Advice¶
It is important for a business to look at their deployment of various containers in a critical way; and analyze what they can repackage into a FaaS and bundle important functions into them, such as password checking. Furthermore solely using Machine Learning to check the password strength is only half the work; this can be improved by many more layers of security, that for example checks how many times a password has been breached. And a static check that ensures that the passwords contains enough complexity in case the machine learning algorithm favors a weaker and more easily cracked password.
References¶
FaaS vs. Microservices - DZone Microservices Christian Posta · et al. https://dzone.com/articles/faas-vs-microservices
Art. 139d lid 2 Sr - Artikel 139d lid 2 Wetboek van Strafrecht Maxius.nl https://maxius.nl/wetboek-van-strafrecht/artikel139d/lid2
Richtlijn voor strafvordering cybercrime (2018R001) - om Richtlijn voor strafvordering cybercrime (2018R001) Categorie: … College van procureurs-generaal: Aan: Hoofden van de OM-onderdelen: Registratienummer: 2018R001: Datum inwerkingtreding: 1 februari 2018: Publicatie in Staatscourant: 2018, Nr. 3271: Relevante beleidsregels: Aanwijzing kader voor strafvordering meerderjarigen (2019A003 … https://www.om.nl/onderwerpen/beleidsregels/richtlijnen-voor-strafvordering-resultaten/…
Password Strength Classifier Dataset bansal https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset