Improve articulation and add cover photo

This commit is contained in:
Harry Stuart 2024-06-12 01:28:34 +10:00
parent 38ff7c8a74
commit 9710e0a44a
3 changed files with 24 additions and 22 deletions

View File

@ -2,6 +2,10 @@
This repository contains a productionised version (including testing and benchmarking) of the Kohonen Self Organising Map model implemented in the problem specification. Please let me know if there are any issues running the code.
The below image represents the final weights of a 1000*1000*3 Kohonen Network trained for 1000 iterations. It took 1 hour and 20 minutes to generate on my machine.
![example](documents/1000.png)
## Setting up this repository
Firstly, due to time constraints, there are components of this codebase that are not abstracted in such a way that they are conducive to collaborative work. There are certain hardcoded elemtns - particularly in the testing and benchmarking modules. I have chosen what I believe to be reasonable defaults for running testing and benchmarking on an average machine.
@ -21,25 +25,24 @@ Run benchmarks using `pytest benchmarks`. Running benchmarks will create MLFlow
## Mantel Code Assessment
Whilst the tests, benchmarking and organisation of my Kohonen Network implementation should intimate a more *production-ready* codebase, I will highlight some key improvements I have made to the implementation.
In the following sections, I have highlighted some key differences between my implementation and Sam's implementation. I will not explore the intricacies of my approach here - but I hope that they are easily observed after looking through the code.
### Using ASCII variables for variable names
The components of a Kohonen Network can all be expressed using mathematical notation, so it seems logical to use these same mathematical symbols in one's code. However, I would suggest two key reasons to Sam why using non-ASCII characters are usually undesirbale.
- Developers reading and maintaining the code might not be familiar with the mathematical formulae and therefore, would be unable to make sense of semantic sense of such variables.
- Non-ASCII characters are not typically found on standard keyboard, making typing such variable names inconvenient
The mechanics of a Kohonen Network can all be expressed using mathematical notation, so it may seem logical to use these same mathematical symbols in one's code. However, I would suggest two key reasons to Sam why using non-ASCII characters are usually undesirbale.
- Developers reading and maintaining the code might not be familiar with the mathematical formulae and therefore, would be unable to make sense of variable semantics.
- Non-ASCII characters are not typically found on standard keyboards, making typing such variable names inconvenient.
Thus, I always use ASCII characters and make a careful effort to utilise descriptive and unambiguous variable names rather than short and forgettable variable names.
### Packaging
Sam's implementation expects to be executed as a Python module due to the `if __name__ == '__main__':` block. Whilst this is okay during first-pass development, I would ask Sam how he would advise others to use his module in this way, and it would soon come to light that there is no CLI argument parsing or other method by which someone could use the module on their own data.
Sam's implementation expects to be executed as a Python module due to the `if __name__ == '__main__':` block. Whilst this is okay during first-pass development, I would ask Sam how he would advise others to use his module in this way, and it would soon come to light that there is no CLI argument parsing or other method by which someone could run the module on their own data.
I would suggest that Sam package his code in a way similar to how I have done - where I've created a `models` package that includes a `kohonen_network.py` module and directly exposes the `train_kohon_network`, allowing for anyone to easily `from models import train_kohon_network` to train their own model on their own data.
### Modularisation
Sam's `train` function is dense and does not make use of any helper functions. For a complex algorithm such as training a Kohonen Network, there are several downsides to having a monolothic fucntion:
Sam's `train` function is dense and does not make use of any helper functions. For a complex algorithm such as training a Kohonen Network, there are several downsides to having a monolithic fucntion:
- Reduced readability
- More difficult to make localised changes
@ -47,11 +50,11 @@ Sam's `train` function is dense and does not make use of any helper functions. F
- Cannot test individual components
- Cannot reuse code
I would advise Sam to consider the example of initialising the model weights. While my `initialise_random_tensor` might have a near-identical implementation to his, I have extended it to support any arbitrary dimensionality and have encapsulated the `numpy` implementation details. This means that `initialise_random_tensor` could be used in any other model that needs to randomly initialise a tensor and the `numpy` implementation could, for example, be swapped out with another implementation such as `jax.numpy` for GPU acceleration, without having to manually update each of those functions.
I would ask Sam to consider the example of initialising the model's weights. While my `initialise_random_tensor` might have a near-identical implementation to his, I have extended it to support arbitrary dimensionality and have encapsulated the `numpy` implementation details. This means that `initialise_random_tensor` can be used in any other pipeline that needs to randomly initialise a tensor and the `numpy` implementation could, for example, be swapped out with another implementation, such as `jax.numpy` for GPU acceleration, without having to manually update each of those functions.
### Typing and comments
Code should tell a story to the reader and be as clear and simple as possible to follow. I would suggest to Sam that he would have better control over the story he is trying to tell if he used type hinting and comments. However, these should only be used if they *add signal* to the code. Commenting `# Add numbers A and B` above `return sum(A, B)` does not add any signal - it dilutes the current signal.
Code should communicate a story to the reader and a good story should flow and be descriptive. I would suggest to Sam that he would have more control over the story he is trying to convey if he used type hinting and comments. However, these should only be used if they *add signal* to the code. Commenting `# Add numbers A and B` above `return sum(A, B)` does not add any signal - it dilutes the current signal.
Let's consider my function:
@ -65,7 +68,7 @@ def _find_best_matching_unit(weights: NDArray[np.float32], x: NDArray[np.float32
return Node(*jnp.unravel_index(bmu, (height, width)))
```
By abstracting this implementation into a helper function, I get the same modularisation benefits as discussed above because someone following a story that uses `_find_best_matching_unit` only needs to understand what it does to continue the story, not how it does what it does. I have done a few things to communicate *what this function does* as concisely as possible.
By abstracting this logic into a helper function, I get the same modularisation benefits as discussed above because someone following a story that uses `_find_best_matching_unit`, only needs to understand what it does to continue the story, not how it does what it does. I have done a few things to communicate *what this function does* as clearly and concisely as possible.
- Including argument type hints provides important context for the reader/user to ascertain what the author expects the function to operate on.
- Declaring a named tuple return type, `Node`, informs the reader what the function returns and is more descriptive than simply `Tuple[int, int]`.
@ -73,18 +76,17 @@ By abstracting this implementation into a helper function, I get the same modula
### Performance
Sam's implementation could benefit considerbaly from some vectorised operations instead of iterating over each node and updating its weight. Vectorised operations in `numpy` are implemented to leverage highly efficient, low-level code that can utilise hardware acceleration - often resulting in large speed-ups.
Sam's implementation could benefit from vectorised operations in place of iterating over nodes to compute weight updates. Vectorised operations in `numpy` are implemented to leverage highly efficient, low-level code that can utilise hardware acceleration - often resulting in large speed-ups.
In my implementation, I abstracted out `_update_weights` so that I could wrap it in [JAX](https://github.com/google/jax) Just-In-Time compilation to compile the function using XLA (Accelerated Linear Algebra.) The below image shows the output of a benchmark comparing my implementation to Sam's implementation for random parameters and inputs. In this benchmark, JAX was configured to use my CPU - one could configure the module to use JAX on a GPU for even greater speed gains. While JAX does add some overhead and may be less efficient for very simple networks, it is orders of magnitude faster for complex networks.
In my implementation, I abstracted out `_update_weights` so that I could wrap it in [JAX](https://github.com/google/jax) Just-In-Time compilation to compile the function using XLA (Accelerated Linear Algebra.) The below image shows the output of a benchmark comparing my implementation to Sam's implementation for random parameters and inputs. In this benchmark, JAX was configured to use my CPU - one could configure the module to use JAX on a GPU for even greater performance boosts. While JAX does add some overhead and may be marginally less efficient for very simple networks, it is orders of magnitude faster for complex networks.
![performance](documents/execution_time_comparison.png)
### Testing
As part of productionising this application, it is critical to rigorously test the implementation so that any bugs are and issues are identified before launching to production, any bugs are picked up after making changes post launching to production and writing tests often highlights code-smells - encouraging better software development practices.
It is critical to test any implementation before it reaches production so that as many unexpected bugs and issues are identified as possible before deployment. Additionally, once the code is in production, any subsequent changes to the code should undergo testing to validate that the new behaviour is as expected. Finally, good tests often sniff out code smells because they require code to be structured in a modular way. Needless to say, testing is absolutely necessary.
I have used `pytest` to write unit and integration tests. In addition, I have used the [Hypothesis](https://hypothesis.readthedocs.io/en/latest/index.html) library to employ property-based testing. Let's consider the below example:
I have used `pytest` to write unit and integration tests in conjunction with [Hypothesis](https://hypothesis.readthedocs.io/en/latest/index.html) - facilitating property-based testing. Let's consider the below example:
```python
@given(
@ -104,21 +106,21 @@ def test_create_kohonen_params_invalid_failure(X, width, height, num_iterations,
assert isinstance(result, Failure) and isinstance(result.failure(), ValueError)
```
Rather than trying to think of differnt edge cases, `hypothesis` will simulate many different inputs that we expect to break the `create_kohonen_params` function. Importantly, `hypothesis` will try and break our test, i.e. find inputs that do not result in a `Failure` return type from `create_kohonen_params`, and it will then find the minimum failing example so the set of parameters that broke the test is as obvious as possible. If this test passes, I can be confident that `create_kohonen_params` is correctly returning a `ValueError` failure whenever it receives invalid inputs (e.g. negative width.)
It is difficult to guess what the different edge-cases may be for a given function. `hypothesis` facilitates property-based testing, whereby, each argument to a test function is attributed with a strategy for its generation; `hypothesis` then uses these strategies to run the test function against a set of pseudo-randomly generated inputs many times over. It works as an adversary in that it will try and find inputs that break the test function - i.e. find breaking edge-cases. In the above case, `hypothesis` aims to find a case where `create_kohonen_params` does not return a `ValueError` failure after receiving invalid arguments (e.g. negative width.)
The concept of a `Result` return type from a function is implemented by the [Returns](https://returns.readthedocs.io/en/latest/pages/result.html) library and is a functional nomad declaring that as a `Result`, it will either have a value if the function succeeded, or one of a defined set of exceptions if it failed. This enforces deliberate and explicit exception-handling/propagation. In the above example, we declare explicitly that `create_kohonen_params` should return a `ValueError` if it receives any invalid input.
I should acknowledge that functions I expose from a module often return a `Result` object, from the [Returns](https://returns.readthedocs.io/en/latest/pages/result.html) library. Rather than throwing and catching errors, a `Result` object serves as a container and allows the function author to declare that an invocation of the function will either be a `Success`, or one of an explicitly defined set of potential `Failure`s. In the above case, the only defined `Failure` for `create_kohonen_params` is a `ValueError` if the input arguments are invalid.
### Benchmarking
I also used `hypothesis` to simulate different scenarios and thus evaluate the performance of both my implementation and Sam's. To assist in the benchmarking, I used MLFlow to allow me to visually compare the models and inspect different metrics. Another benefit of MLFlow is the ability to inspect metrics across iterations. For example, in the below image, we can see that both neighbourhood radius and learning rate reduce exponentially. This is a solid sanity check. Using a tool such as MLFlow makes comparing experiments and collaborating on models far easier. My inclusion of MLFlow in this codebase is pretty barebones and does not have any secret injection. I would encourage Sam to use MLFlow early into model development so he can measure and quantify how different model implementations and versions perform.
I also used `hypothesis` to simulate inputs for the `benchmark_kohonen_networks_performance_mlflow` benchmark to remove my own bias in creating test cases to compare the two models. To assist in the benchmarking, I used MLFlow which enabled me to compare the models in a GUI and easily inspect different metrics. Another benefit of MLFlow is the ability to inspect metrics across iterations. For example, in the below image, we can see that both `neighbourhood_radius` and `learning_rate` reduce exponentially. This is a solid sanity check because we expect this to be the case. Using a tool such as MLFlow makes comparing experiments and collaborating on models far easier. My inclusion of MLFlow in this codebase is pretty barebones and does not have any secret injection. I would encourage Sam to use MLFlow early on in model development so he can measure and quantify how different model implementations and versions perform.
![MLFlow](documents/mlflow.png)
### Deployment
Since one of the primary benefits of the Kohonen Map is to perform dimensionality reduction, it is possible that all that need be deployed is the `models` package. One would likely want to name the package more descriptive, such as `kohonen_network`. It would be straightforward to use GitHub Actions to automatically update the package and submit it to a PyPi repository, allowing people to use `train_kohon_network` and perform dimensionality reduction in their own pipelines.
Since one of the primary benefits of the Kohonen Map is to perform dimensionality reduction, it is possible that all that need be deployed is the `models` package. One would likely want to name the package more descriptive, such as `kohonen_network`. It would be straightforward to use GitHub Actions to automatically update the package and submit it to a PyPi repository, allowing people to use `train_kohon_network` and perform dimensionality reduction in their own pipelines. In this way, the model is *baked* into the user's codebase and they can be responsible for its deployment.
Alternatively, the package could be imported into a Python Flask application that is hosted in a Docker container in a Kubernetes cluster or on a bare-metal server. This would make for an endpoint that any other services could use. Rather than a self-managed Flask app, one could use Databricks to solve the model, which has the added benefit of natively fitting into Databricks pipelines.
Alternatively, the package could be imported into a Python Flask application that is hosted in a Docker container in a Kubernetes cluster or on a bare-metal server. This would make for an endpoint that other services could use. Rather than a self-managed Flask app, one could alse use Databricks to serve the model. This has the added benefit of natively fitting into Databricks pipelines. There are myriad ways one might wish to serve this model but the exact approach depends on the team, existing infrastructure, and problem at hand. It is also possible that the use-case demands storing the learned weights for comparison with newly observed data (i.e. anomaly detection.) In such a circumstance it may be logical to have a model artifact repository.
A few example use-cases (beyond dimensionality reduction as a data preprocessing step) are:

View File

@ -5,9 +5,9 @@ from models import train_kohonen_network, create_kohonen_params
X = np.random.random((1000,3))
IMAGE_SIZE = 10
IMAGE_SIZE = 100
params = create_kohonen_params(X, IMAGE_SIZE, IMAGE_SIZE, 1000).unwrap()
params = create_kohonen_params(X, IMAGE_SIZE, IMAGE_SIZE, 100).unwrap()
image_data = train_kohonen_network(X, params, use_mlflow=True).unwrap()
plt.imsave(f'{IMAGE_SIZE}.png', image_data)

BIN
documents/1000.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 542 KiB