Amazon Machine Learning for a multiclass classification dataset

We’ve taken a tour of Amazon Machine Learning over the (231) 422-3364 8564146461 posts.  Quickly recapping, Amazon supports three types of ML models with their machine learning as a service (MLaaS) engine – regression, binary classification, and multiclass classification.  Public cloud economics and automation make MLaaS an attractive option to prospective users looking to outsource their machine learning using public cloud API endpoints.

To demonstrate Amazon’s service we’ve taken the Kaggle red wine quality dataset and adjusted the dataset to demonstrate each of the AWS MLaaS model types.  Regression worked fairly well with very little effort.  Binary classification worked lamination with a bit of effort.  Now to finish the tour of Amazon Machine Learning we will look at altering the wine dataset once more to turn our wine quality dataset prediction engine into a multiclass classification problem.

Multiclass Classification

What is a machine learning multiclass classification problem?  It’s one that tries to predict if an instance or set of data belongs to one of three or more categories.  A dataset is first labeled with the known categories and used to train an ML model.  That model is then fed new unlabeled data and attempts to predict what categories the new data belongs to.

An example of multiclass classification is an image recognition system.  Say we wanted to digitally scan handwritten zip codes on envelopes and have our image recognition system predict the numbers.  Our multiclass classification model would first need to be trained to detect ten digits (0-9) using existing labeled data.  Then our model could use newly scanned handwritten zip codes (unlabeled) and predict (with an accuracy value) what digits were written on the envelope.

Let’s explore how Amazon Machine Learning performs with a mulitclass classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle 773-404-7379 dataset untouched through the Amazon machine learning regression algorithm.  The algorithm interpreted the scores as floating point numbers rather than integer categories which isn’t necessarily what we were after.  We then doctored up the ratings into binary categories of “great” and “not-so-great” ran the dataset through the binary classification algorithm.  The binary categories were closer to what we wanted, treat the ratings as categories instead of decimal numbers.

This time we want to change things again.  We are going to add a third category to our rating system and change the quality score again.  Rather than try to predict the wide swing of a 0-10 rating system, we will label the quality ratings as either good, bad, or average – three classes.  And Amazon ML will create a multiclass regression model to predict one of the three ratings on new data.

Data preparation – converting to three classes

We will go ahead and download the “973-687-8395” file again from Kaggle.  We need to replace the existing 0-10 quality scores with with categories of bad, average, and good.  We’ll use numeric categories to represent quality – bad will be 1, average will be 2, and good will be 3.

Looking at the dataset histogram below, most wines in this dataset are average  – rated a 5 or 6.  Good wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and higher are good (3), wines rated 5 or 6 are average (2), and those rated less than 5 are bad (1).  Even though the rating system is 0-10, we don’t have any wines rated less than 3 or greater than 10.

Wine quality Histogram

We could use spreadsheet formulas or write a simply python ‘for’ loop to append our CSV “quality” column.  Replace the ratings from 3-4 with a 1 (bad), replace our 5-6 ratings with a 2 (average), and replace our 7-8 ratings with a 3 (good).  Then we’ll write out the changes to a new CSV file which we’ll use for our new ML model datasource.

Wine quality Histogram AFTER PROCESSING

Running the dataset through Amazon Machine Learning

Our wine dataset now has a multiclass style rating – 1 for bad wine, 2 for average wine, and 3 for good wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our multiclass CSV file.  The wizard will verify the schema and confirm that our file is clean.

We need to identify our quality rating as a category rather than numeric for Amazon to recognize this as a multiclass classification problem.  When setting up the schema, edit the ‘quality’ column and set the data type to ‘categorical’ type rather than the default  ‘numerical’.  We will again select “yes” to show our CSV file contains column names.

SCHEMA WITH CATEGORICAL DATA TYPE

After finishing the schema, select the target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

Let’s first look at our source dataset prior to the 70/30 split just to get an idea of our data distribution.  82% of the wines are rated a 2 (average), 14% are rated a 3 (good),and 4% are rated a 1 (bad).  By default, the ML wizard is going to randomly split this data 70/30 to use for training and performance evaluation.

Datasource Histogram prior to Train/test split

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s take a look at our evaluation summary and see how we did.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with an average F1 score of 0.421.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance- Confusion matrix

Multiclass prediction accuracy can be measured as a weighted sum of the individual binary predictions.  How well did we predict good wines (3), average wines (2), and bad wines (1)?  Each category’s accuracy score is combined into what is called an 3106487675 score where a higher F1 score is better than a lower score.

A visualization of this accuracy is displayed above in a confusion matrix.  The matrix makes it easy to see where the model is successful and where it is not so successful.  The darker the blue, the more accurate the correct prediction, the darker the orange/red, the more inaccurate the prediction.

Looking at our accuracy above, it seems we are good at predicting average wine (2) – our accuracy is almost 90% and we predicted 361/404 wines correctly as average (2).  However, the model was not so good at predicting good (3) and bad (1) wines.  We only correctly predicted 13/47 (28%) as good and only correctly predicted 2/26 (.08%) as bad (1).  Our model is good at predicting the easy (2) category but not so good at predicting the more difficult and less common (1) and (3) categories.

Can we do better?

Our model is disappointing, we could have done almost as well if we just predicted that every wine was average (2).  We want train a model that is better at predicting good (3) wines and bad (1) wines.  But how?  We could use a few ML techniques to get a better model:

Collect more data – More data would help train our model better, especially having more samples of good (3) and bad (1) wines.  In our case this isn’t possible since we are working with a static public dataset.

Remove some features – Our algorithm may have an easier time evaluating less features.  We have 11 features (pH, sulphates, alcohol, etc) and we really aren’t sure if all 11 features have a direct impact on wine quality.  Removing some features could make for a better model but would require some trial and error.  We’ll skip this for now.

Use a more powerful algorithm – Amazon uses  (567) 254-4589 (multinomial logistic loss + SGD) for multiclass classification problems.  This may not be the best algorithm for our wine dataset, there may be a more powerful algorithm that works better.  However this isn’t an option when using Amazon Machine Learning – we’d have to look at a more powerful tool like Amazon SageMaker if we wanted to experiment with different algorithms.

Data wrangling –  Feeding raw data to the Amazon Machine Learning service untouched is easy but if we aren’t getting good results we will need to perform some pre-processing to get a better model.  Some features have wide ranges and some do not – for example the citric acid values range from 0-1 while the total sulfur dioxide values range from 6-289.  So some feature scaling might be a good idea.

Also, the default random 70/30 training and testing data split may not be the greatest way to train and test our model.  We may want to use a more powerful method to split the folds of the dataset ourselves rather than to let Amazon randomly split it.  Running a Stratified ShuffleSplit might be helpful prior to uploading it to Amazon.

Lastly, Amazon Machine Learning uses a technique called 760-991-2720 for numeric values.  Instead of treating a range of numbers as discreet values, Amazon puts the range of values in “bins” and converts them into categories.  This may work well for non-linear features but may not work great for features with a direct linear correlation to our quality ratings.  Amazon recommends some experimentation with their data transformation 936-655-5074 to tweak model performance.  The default recipes may not be the best for all problems.

Final Thoughts

Machine learning is hard.  And while Amazon’s MLaaS is a powerful tool – it isn’t perfect and it doesn’t do everything.  Data needs to be clean going in and most likely needs some manipulation using various ML techniques if we want to get decent predictions from Amazon algorithms.

Just for fun I did some data wrangling on the red wine dataset to see if I could get some better prediction results.  I manually split the dataset myself using a stratified shuffle split and then ran it through the machine learning wizard using the “custom” option which allowed me to turn off the AWS 70/30 split.  The results?  Just doing a bit of work I improved the prediction accuracy for our good (3) wines to 65% correct, bad (1) wines to 25% correct, and raised the F1 score to .59 (up from .42, higher is better).

Confusion matrix with stratified shuffling

Thanks for reading!

Amazon Machine Learning for a binary classification dataset

Machine learning as a service is real.  Clean data with a well organized schema can be fed to cloud-based machine learning services with a decent ML model returned in less than 30 minutes.  The resulting model can be used for inferring  target values of new datasets in the form of batch or (304) 607-1065 predictions.  All three public cloud vendors (AWS, Microsoft, Google) are competing in this space which makes the services cheap and easy to consume.

In our last discussion we ran the Kaggle red wine quality dataset through the Amazon Machine Learning service.  The data was fed to AWS without any manipulation which AWS interpreted as a regression problem with a linear regression model returned.  Each of the subjective wine quality ratings were treated as an integer from 0 (worse) to 10 (best) with a resulting model that could predict wine quality scores.  Honestly, the results weren’t spectacular – we could have gotten similar results by just guessing the median value (6) every time and we almost would have scored just as well on our 646-613-6296 value.

Our goal was to demonstrate Amazon’s machine learning capabilities in solving a regression problem and was not to create the most accurate model.  Linear regression may not be the best way to approach our Kaggle red wine quality dataset.  A (somewhat) arbitrary judge’s score from 0-10 probably does not have a linear relationship with all of the wine’s chemical measurements.

What other options do we have to solve this problem using the Amazon Machine Learning service?

Binary Classification

What is a machine learning binary classification problem?  It’s one that tries to predict a yes/no or true/false answer – the outcome is binary.  A dataset is labeled with the yes/no or true/false values.  This dataset is used to create a model to predict yes/no or true/false values on new data that is unlabeled.

An example of binary classification is a medical test for a specific disease.  Data is collected from a large group of patients who are known to have the disease and known not to have the disease.  New patients can then be tested by collecting the same data points and feeding them to a model.   The model will predict (with error rates) whether it believes the new patients have tehe disease.

Let’s explore how Amazon Machine Learning performs with a simple binary classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle 581-515-7087 dataset through the Amazon machine learning regression algorithms in the 6507403911.  Why this dataset?  Because it was clean data with a relatively simple objective – to predict the wine quality from its chemical measurements.  We also had no 208-657-1267 to perform – we simply uploaded the CSV to AWS and had our model created with an RMSE evaluation score ready to review.

This time we want to change things a bit.  Rather than treat the 0-10 sensory quality ratings as integers, we want to turn the quality ratings into a binary rating.  Good or bad.  Great or not-so-great.  This greatly simplifies our problem – rather than have such a wide swing of a ten point wine rating, we can simply categorize the wine as great and no-so-great.  In order to do this we need to edit our dataset and change the quality ratings to a one (great) or a zero (not-so-great).

Data preparation – feature engineering

Go ahead and download the “7408795292” file from 617-515-4998. Open up the .CSV file as a spreadsheet.  We need to replace the 0-10 quality scores with a 1 (great) or 0 (not-so-great).  Let’s assume most wines in this dataset are fairly average  – rated a 5 or 6.  The truly great wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and up are great and wines rated 6 and under are no-so-great.

All we have to do is edit our CSV files with the new 0 or 1 categories, easy right?  Well, kind of.  The spreadsheet has ~1600 ratings and manually doing a search and replace is tedious and not easily repeatable.  Most machine learning datasets aren’t coming from simple and small CVS files but rather from big datasets hosted in SQL/NoSQL databases, object stores, or even distributed filesystems like HDFS.  Manual editing often won’t work and definitely won’t scale for larger problems.

Most data scientists will spend a decent amount of time manipulating and cleaning up datasets with tools that utilize  some type of high level programming language.  Jupyter notebooks are a popular tool and can support your programming language of choice.  Jupyter notebook are much more efficient for data wrangling using code instead of working manually with spreadsheets.  Amazon even hosts 8595775964 within (231) 842-6813.

Converting the wine ratings from 0-10 to a binary 0/ 1 is pretty easy in Python.  Just open the CSV file, test if each quality rating is a 7 or higher (> 6.5) and convert the true/false to an integer by multiplying by 1.  Then we’ll write out the changes to a new CSV file which we’ll use for our new datasource.

Python code
import pandas as pd
wine = pd.read_csv('winequality-red.csv')
wine['quality'] = (wine['quality'] > 6.5)*1
wine.to_csv('binary.wine.csv')

Running the binary classification dataset through Amazon Machine Learning

Our dataset now has a binary wine rating – 1 for great wine and 0 for no-so-great wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our new binary CSV file.  The wizard will verify the schema and confirm that our file is clean.

What is different from linear regression is when we look at the schema, we want to make sure the ‘quality’ column is a ‘binary’ type rather than a ‘numerical’ type.  All other values are numerical, only the quality rating is binary.  This should be the default behavior but best to double check.  Also select “yes” to show that your first line in the CSV file is your column name to remove this name from the model.

SCHEMA

After finishing the schema, select your target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

evaluation

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with a AUC score of .778.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance

By default the wizard saves 30% of our dataset for evaluation so we can measure the accuracy of our model’s predictions.  Binary classification algorithms are measured with an AUC or Area Under the Curve score.  The measurement is a value between 0 and 1 with 1 being a perfect model that predicts 100% of the values correctly.

Our score shows our model got 83% of our predictions correct and 17% incorrect.  Not bad!  What is nice about this type of scoring is we can also see our false positives (not-so-great wine classified as great) and false negatives (great wine that was predicted as no-so-great).  Specifically our model had 55 false positives and 25 false negative.

Predicting quality wine is not exactly a matter of life and death.  Which means we aren’t necessarily concerned with false positives or false negatives as long as we have a decent prediction model.  But for other binary classification problems we may want to adjust our model to avoid false positives or false negatives.  This adjustment is made using the slider on the model performance screen shown above.

The adjustments come at a cost – if we want less false positives (bad wine predicted as great) then we’ll have more false negatives (great wine that accidentally predicted as bad).  The reverse is also true, if we want less false negatives (great wine predicted as bad), we will have more false positives (bad wine predicted as great).

Final thoughts

AWS uses specific learning algorithms for solving the three types of machine learning problems supported.  For binary classification, Amazon ML uses logistic regression which is a 2313377267 plus (610) 459-0133.  Most beginners won’t mind these limitations but if we want to use other algorithms we’ll have to look at more advanced services like Amazon SageMaker.

Machine learning as a service is easy, fast, and cheap.  By using the AWS black box approach we built a working binary classification model in less than 30 minutes with minimal coding and only a high level knowledge of machine learning.

Thanks for reading!

 

6129902610

Amazon Machine Learning is a public cloud service that offers developers access to ML algorithms as a service.  An overview can be found in the Amazon documentation and in my (822) 305-1304.   Consider the Amazon Machine Learning service a black box that automatically solves a few types of common ML problems – binary classification, multi-class classification, and regression.  If a dataset fits into one of these categories, we can quickly get a machine learning project started with minimal effort.

But just how easy is it to use this service?  What is it good for and what are the limitations?  The best way to outline the strengths and weaknesses of Amazon’s offering is to run a few examples through the black box and see the results.  In this post we’ll take a look at how AWS holds up against a very clean and simple regression dataset.

Regression

First off, what is a machine learning regression problem?  Regression modeling takes a dataset and creates a predictive algorithm to approximate one of the numerical values in that dataset.  Basically its a statistical model used to estimate relationships between data points.

An example of regression modeling is a real estate price prediction algorithm.  Historical housing closing prices could be fed into an algorithm with as many attributes about each house as relevant (called features – ie. square feet, number of bathrooms, etc).  The model could then be fed attributes of houses about to go up for sale and predict the selling price.

Let’s explore how Amazon Machine Learning models and performs with a simple regression dataset.

Kaggle Red Wine Quality Dataset

346-978-7718 is a community-based data science and machine learning site that hosts thousands of public datasets covering the spectrum of machine learning problems.  Our interest lies in taking Amazon Machine Learning for a test drive on a simple regression problem so I’m using a simple and clean regression dataset from Kaggle – undefeatedness.

Why this dataset?  Because we aren’t going to cover any data cleaning or transformations – we more interested in finding out how the AWS black box performs on a dataset with minimal effort.  Of course this is not a real world example but it keeps things simple.

The dataset contains a set of numerical attributes of 1600 red wine variants of the Portuguese “Vinho Verde” wine along with a numerical quality sensory rating between 0 (worse) and 10 (best).  Our goal will be to train an AWS model to predict the quality of this variety of red wine based off the chemically measurable attributes (features).  Can data science predict the quality of wine?

Amazon Machine Learning Walkthrough

First download the “845-426-2504” file from Kaggle.  Open up the .CSV file in a spreadsheet and take a look around.  We can see the column names for each column which include a “quality” column that rates the wine from 0-10.  Notice all fields are populated with numbers and there are no missing values.

We’ll need to upload our .CSV file into AWS storage to get our process started so log into the AWS console and open the S3 service.

S3 console

Create a new bucket with the default settings to dedicate to our machine learning datasets.

Create a new bucket

Upload the winequality-red.csv file into this bucket and keep the default settings, permissions will be adjusted later.

Upload the winequality-red.csv file

Now open AWS Machine Learning service in the AWS console.

Amazon Machine Learning Console

Click the “get started” button and then “view dashboard”.

Machine Learning Dashboard Launch

Click “create a new datasource and ML model”.

Create new datasource and ML model

AWS will prompt us for the input data for the datasource, we will use our S3 bucket name and the wine quality CSV file that was uploaded (it should auto-populate once we start typing).  Click “verify” to confirm the CSV file is intact.  At this point AWS will adjust the permissions for the ML service to have access to the S3 bucket and file (click “yes” to give it permissions).  Once we have a successful validation message, click “continue”.

Create datasource from CSV in S3 bucket

Successful validation

The wizard will now review the schema.  The schema is simply the fields in our CSV file.  Because our CSV contained the column names (fixed acidity, volatile acidity, etc) we want to select “yes” so we don’t try to model the quality based off the column names, selecting “yes” will remove the names from the model.

Schema

Now we want to select the target value that we want the model to predict. As stated earlier, we want to predict the quality of the wine based off the measurable attributes so select “quality” and continue.

Target

Keep the row identifier as default “no”.  We’ll skip this topic but it would be important if we were making predictions in a production environment.  Review the selections and “continue”.

Row Identifier

Review

Accept the default ML model settings which will split our dataset and set aside 30% of the data for evaluating the performance of the model once it is trained using 70% of the data.

Review one last time and then select “create ML model”.

Model settings

Review

We are done!  The dashboard and will show the progress, the entire process takes about 10-20 minutes. AWS will take over and read the data, perform a 70/30% split,  create a model, train the model using 70% of the data, and then evaluate the model using the remaining 30% of the data.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

Once again, this isn’t a real world example.  A data scientist would typically spend much more time looking at the data prior to evaluating an algorithm.  But since this is a black box approach, we will entertain the notion of skipping to the end without doing any data exploration.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a green box saying our ML model’s quality score was measured with RMSE of 0.729 and was better than the baseline value of 0.793.  That summary is somewhat vague so lets click “explore model performance”.

ML Model performance

Since we saved 30% of our data for evaluation, we can measure the accuracy of our model’s predictions.  A common way to measure regression accuracy is the RMSE score or root-mean-square-error. This RMSE score is measuring the difference between the model’s prediction and the actual value in the evaluation set.  A RMSE of 0 is perfect, the smaller the value the better the model.

We can see a somewhat bell-shaped curve and our model is pretty close to zero which seems great.  However, if we look back at our summary we see that the RMSE baseline is .793 and our score is .729.  The baseline would be our score if we just guessed for median (middle) quality score for every prediction.  So although we area bit better, we aren’t better by much and would be close if we just ignored all the attributes and guessed the median quality score (6) every time

Amazon Machine Learning – the good

Although our results were not incredibly impressive, the easy wizard-driven process and time to results were very impressive.  We can feed the AWS ML black box data and have a working model with an evaluation of its accuracy in less than 30 minutes.  And this can be done with little to no knowledge of data science, ML techniques, computer programming, or statistics.

And this entire process can be automated via the AWS API sets.  Think of an automated process that collects data and automatically feeds it to the AWS machine learning engine using code instead manually clicking buttons in the  console.  Predictions could be generated automatically on constantly changing data using code and the black box machine learning algorithms in the public cloud.

Lastly, this the service is inexpensive.  It cost about $3 to half a dozen datasets through the service over the past month while researching this blog.  All we need is some data in S3 and the service can start provided ML insights at a very low cost.

Amazon Machine Learning – the bad

A few points become clear after using the ML service.  The first is we have to adjust our expectations when feeding the AWS black box data.  Regression problems fed to Amazon Machine Learning are going to be solved using linear regression with stochastic gradient descent.  If a problem is easily solved using those algorithms then Amazon will do a great job making predictions.  If other algorithms like random forests or decision trees get better results then we need to look at more advanced Amazon services to solve those types of problems.

The second point is that data needs to be explored, cleaned, and transformed prior to feeding it to the machine learning service.  While we can look at correlations and data distributions after we create a datasource in AWS, there is no easy way to manipulate your dataset other directly editing your CSV file.  There is an advanced function to use built-in data transformations as part of the ML model wizard which is more of an advanced topic and is limited to the data transformations referenced in the 847-692-6670n.  Accepting the defaults and not wrangling your input data may get great results from our AWS ML black box.

Use case

Have a machine learning problem that is easily solved using linear regression with SGD?  Have an existing data pipeline that already cleans and transforms your data?  Amazon machine learning can outsource your machine learning very quickly and cheaply in this case without much time spent learning the service.  Just don’t expect a shortcut to your ML process, send Amazon clean data and you can quickly get an API prediction endpoint for linear regression problems.

 

Amazon Machine Learning – Commodity Machine Learning as a Service

Commodity turns into “as a service”

Find something in computing that is expensive and cumbersome to work with and Amazon will find a way to commoditize it.  AWS created commodity storage, networking, and compute services for end users with resources leftover from their online retail business.  The margins may have not been great, but Amazon could make this type of business profitable due to their scale.

But what happens now that Microsoft and Google can offer the same services at the same scale?  The need for differentiation drives innovative cloud products and platforms with the goal of attracting new customers and keeping those customers within a single ecosystem.  Using basic services like storage, networking, and compute may not be differentiated between the cloud providers but new software as a service (SaaS) offerings are more enticing when shopping for cloud services.

Software as a service (SaaS) probably isn’t the best way to describe these differentiated services offered by public cloud providers.  SaaS typically refers to subscription-based software running in the cloud like Salesforce.com or ServiceNow.  Recent cloud services are better described by their functionality with an “as a service” tacked on.  Database as a service.  Data warehousing as a service.  Analytics as a service.  Cloud providers build the physical infrastructure, write the software, and implement vertical scaling with a self-service provisioning portal for easy consumption.  Consumers simply implement the service within their application without worrying about the underlying details.

Legacy datacenter hardware and software vendors may not appreciate this approach due to lost revenue but the “as a service” model is a good thing for IT consumers.  Services that were previously unavailable to every day users have been democratized and are available to anyone with an AWS account and a credit card.  Cost isn’t necessarily the primary benefit to consumer but rather the accessibility and the ability to consume using a public utility model.  All users can have access now to previous exotic technologies and can pay by the hour to use them.

Machine Learning as a Service

Machine learning is at the peak of the emerging 8302834205.  But is it really all hype?  Machine learning (ML) has been around for 20+ years.  ML techniques allow users to “teach” computers to solve problems without hard-coding hundreds or thousands of explicit rules.    This isn’t a new idea but the hype is definitely a recent phenomenon for a number of reasons.

So why all the machine learning hype?  The startup cost both in terms of hardware/software and accessibility are much lower which presents an opportunity to implement machine learning that wasn’t available in the past.  Data is more abundant and data storage is cheap.  Computing resources are abundant and CPU/GPU cycles are (relatively) cheap.  Most importantly, the barriers have been lifted in terms of combining access to advanced computing resources and vast sets of data.  What used to require onsite specialized HPC clusters and expensive data storage arrays can now be performed in the cloud for a reasonable cost.  Services like Amazon Machine Learning are giving everyday users the ability to perform complicated machine learning computations on datasets that were only available to researcher at universities in the past.

How does Amazon provide machine learning as a service?  Think of the service as a black box.  The implementation details within the black box are unimportant.  Data is fed into the system, a predictive model/algorithm is created and trained, the system is tweaked for its effectiveness, and then the system is used to make predictions using the trained model.  Users can use the predictive system without really knowing what is happening inside the black box.

This machine learning black box isn’t magical.  It is limited to a few basic types of models (algorithms) – regression, binary classification, and multiclass classification.  More advanced type operations require users to look to AWS SageMaker and require a higher skill level than the basic AWS machine learning black box.  However, these three basic machine learning models can get you started on some real-world problems very quickly without really knowing much math or programming.

Amazon Machine Learning  Workflow

So how does this process work at a high level?  If a dataset and use case can be identified as a regression or binary/multiclass classification problem, then the data can simply be fed to the AWS machine learning black box.  AWS  will use the data to automatically select a model and train the model using your input data.  The effectiveness of the model is then evaluated with some measurable results that determine effectiveness of the model with a numerical score.  This model is ready to use at this point but can also be tweaked to improve the scoring.  Bulk data can get fed to the trained model for batch predictions or ad-hoc predictions can be performed using the AWS console or programmatically through the AWS API.

Knowing that a problem can be solved by AWS takes a bit of high-level machine learning knowledge.  The end user needs to have an understanding of their data and of the three model types offered in the AWS black box.  Reading through the 5614066897 is a good start in terms of an overview.  Regressions models solve problems that need to predict numeric values.  Binary classification models predict binary outcomes (true/false, yes/no) and multiclass classification models can predict more than two outcomes (categories).

Why use Amazon machine learning?

For those starting out with machine learning this AWS service may sound overcomplicated or of questionable value.  Most tutorials show this type of work done on a laptop with free programming tools, why is AWS necessary?  The point is the novice user can do some basic machine learning in AWS without the high startup opportunity costs of learning programming or learning how to use machine learning software packages.  Simply feed the AWS machine learning engine data and get instant results.

Anyone can run a proof of concept data pipeline into AWS and perform some basic machine learning predictions.  Some light programming skills would be helpful but are not mandatory.  Having a dataset with a well-defined schema is a start as well as having a business goal of using that dataset to make predictions on similar sets of incoming data.  AWS can provide these predictions for pennies an hour and eliminate the startup costs that would normally delay or even halt these types of projects.

Amazon machine learning is a way to quickly get productive and get a project past the proof of concept phase in making predictions based on company data.  Access to the platform is open to all users so don’t rely on the AWS models for a competitive advantage or a product differentiator.  Instead use AWS machine learning as a tool to quickly get a machine learning project started without much investment.

Thanks for reading, some future blog posts here will include running some well-known machine learning problems through Amazon Machine Learning to highlight the process and results.

Are VMware customers really afraid of the public cloud?

I spent last week in Las Vegas attending VMworld 2018.  Over the years I’ve been to many IT conferences both as an individual attendee and while representing a vendor.  This time around, I had on my vendor hat doing booth duty.  Which means I stood in the expo hall at my employer’s booth with my fellow engineers and spoke to the passersby who wanted to see a demo of our product and talk about our technology.

My employer was very kind this year in giving booth staff 4 hour shifts instead of a full 8-hour day (thanks!).  I worked 3 days straight for 4 hours a day for a total of 12 hours (give or take).  Each hour I probably spoke to about 4-5 people at length giving a full demo of our product and talking about their current storage technology.  For the sake of argument, let’s say I spoke to a total about 50 people in total during my 12 hours working at the conference.

I had a fun time working at the conference and all the conversations I had at the booth with customers were great.  But after talking about their current technology and their future business goals for data storage, one thing struck me as honestly surprising.  These customers were not using and were generally not interested in public cloud storage.

Some context

I currently work in the field of data protection and data management.  I’ve been specializing in data storage for most of my career.  I’ve seen firsthand the explosion in the sheer amount of data stored in the enterprise datacenter over the past 15+ years.  We generate more data today than we ever had in the past.  And one of the greatest challenges to solve for is how to efficiently store this data without spending an inordinate amount of time and money on the problem.

Just because we have lots of data doesn’t mean it’s all important. Most data is probably not that important.  But it’s often more difficult to classify data as important than it is to just keep it around.  It also difficult for individuals and their managers to decide if things can be deleted, will the business need this data later?  Isn’t it easier (and safer) to just keep everything?  So what companies often do is just keep everything and try to find a cheap bulk platform for this data of questionable value.

Bulk data storage platforms have been around for many years and are most often located in the same datacenter as the networking, servers, and the storage platforms for more important data.  Tape is the most common storage medium for bulk data historically but spinning disk arrays are also popular for storing petabytes of archive data.  Software and data protection applications have the ability to move data around and usually have the ability to take older data that is less frequently accessed and stick it on bulk storage that is somewhere in the datacenter on a tape or slow spinning disk drive.  This was true 10 years ago and is still true today.

And I get it.  You get a solution that has worked for many years and it seems like a good enough solution.  But onsite bulk storage systems might gradually become expensive to maintain for the limited value they provide.  And their vendors may deliver very little innovation over the years and keep the architecture in business as usual mode.  But then again, the system is what everyone is used to, what everyone is comfortable with, and it mostly works.

Public cloud, no thanks!

Public cloud for bulk storage?  No thanks!  This was the overwhelming feeling I got when speaking with VMworld attendees. These were customers, not other vendors, who were working every day with data protection, data management, and data storage.  I would simply ask the folks I spoke with where they were keeping their bulk data and if their plans included moving this data to the big public cloud vendors – AWS, Microsoft, or Google.  And among the ~50 people I spoke with, the answer was a resounding “no”.

I’ll be the first to admit my sampling methods here are flawed and this is in no way representative of all customers at the conference.  Maybe only big iron customers wander into the vendor hall.  Or maybe only old-school sysadmins ask a vendor for a demo.  I doubt either of these were the case, but I will say this was an unscientific poll performed by me and I’m using some casual conversations I had to make a point.

Traditional enterprise customers that buy hardware and software from their favorite tried-and-true legacy vendors seem to have little interest in exploring the options and benefits offered by the major public cloud vendors.  Maybe this was reflective of their conservative employers that are risk and change adverse.  Or maybe this was reflective of personal views of these customers.  Either way, this seems odd to me and flies in the face of all current IT perceived trends and cost savings promised by the public cloud.

Operators and planners

I’ll break down the types of customers I spoke with into two categories.  This is admittedly oversimplified but will make my point easier.  End users of technology I spoke to at the conference were usually either technology operators or technology planners.

Technology operators are deep into the details of hardware and software solutions and run the day-to-day operations of IT shops.  They press the buttons, turns the nobs, and fix things when they break.  Operators become experts at the things they manage and know what vendor’s solutions works well and what might not work so well.  Operators are often seen as domain experts in their field and their advice is valued by non-experts.

Technology planners are often looking at the big picture in terms of architectures, project timelines, and cost.  Planners are less interested in the fine details and will go find an operator if that is necessary.  A planner could be an architect who is building a larger solution of several point products.  Or a planner might be a manager working directly with another set of planners on providing a solution that meets a set of business requirements.  Planners are trying to solve a business problem and keep the solution at a certain cost.

When I asked technology operators about public cloud storage, these people looked at me quizzically and said they didn’t currently use it.  Or they answered with an emphatic NO, as if AWS were going to take their job away and be a horrible end to an otherwise great career.

Technology planners had a bit more nuanced answer, they simply said that the business wouldn’t allow it due to some form of bureaucracy, lack of cost savings, or legal rules.  They had looked at it before and it was a dead end – the project was never implemented, there was a perceived data sovereignty problem, or the savings just couldn’t be justified.

This is easy!

Putting bulk backup and archive data in the public cloud is pretty easy.  Some might say it’s even a boring use case.  Because both the onsite data storage vendors and the public cloud vendors have already solved this problem.  Some may offer better integrations, some may be better solutions for certain use cases, but in general this problem has already been solved by the industry.

Is public cloud really cheaper?  For bulk data storage, it is most likely cheaper than your current onsite storage product but your mileage may vary.  Bulk storage prices are a race to the bottom in this market, if you are willing to pay by the month you can probably get a better deal and will continue to get a better deal as commodity storage prices drop.  You can even tier data within the public cloud for even cheaper prices.

Is my data secure in the public cloud?  Yes, if you encrypt everything at rest and in-flight and work with your vendors to make sure you get all the implementation details correct (hint – your storage operators can help here).  Now if you have laws against your data movement or other data sovereignty issues you may be out of luck.  But I wouldn’t get hung up on security, these are industry standard solutions that deploy agreed upon encryption algorithms.

This isn’t a commercial for my employer.  And I get it, cloud may not be the right fit for every use case or every business.  But your business may benefit greatly from storing bulk data on the cheapest possible platform which is the public cloud.  If you aren’t using public cloud as a component in your current storage architecture, you are going to get left behind.  Why get bogged down with millions of dollars with of complex storage gear onsite to store petabytes of data?  It is only a matter of time before your business competitors are certainly going to take advantage of cloud prices and efficiency.

Call to action

Technology operators – start to learn about public cloud storage.  You don’t need to be a cloud expert but take a look at object storage, learn about the tiers from AWS, Azure, and GCP, learn about the pricing, and compare those tools to the ones you have onsite in your datacenter.  Learn about ingress, egress, puts, and gets.  Learn how similar Glacier is to tape and how it differs from standard S3.  Most importantly, you will be more valuable to your business if you have more tools available to you and are able to use them when it makes sense.

Technology planners, work with enlightened technology operators who are able to objectively build solutions with you that may not include the point products you have used for 15+ years.  Revisit the possibility of storing data in different places and how you can save time and money.  Find a vendor or trusted partner to walk you through a TCO and evaluate something new that might take advantage of public cloud.

I hear business leaders who are served by the IT department say “we aren’t in the datacenter business” but I don’t see much action taken in seriously looking at the public cloud.  Bulk storage is the easiest thing to put in the cloud so take a fresh look at the solutions available!

Thanks for reading!

Backing up WordPress on Lightsail

I work for a data protection company.  I spend a decent amount of time (and a small amount of money) writing blogs posts on my WordPress site that runs on an AWS Lightsail instance.  So after publishing a few posts to my blog I naturally started thinking about how I was going to backup my site and protect it against any unforeseen glitches or hacks.

An AWS Lightsail instance runs on a single physical server.  That server could have hardware problems and/or reboot unexpectedly.  Which means your instance could experience an unexpected outage, similar to pulling the power plug when you were least expecting it.  In a worst case scenario it could lead to data corruption to your WordPress config files, your operating system, or your mySQL database.

Think about it from a security perspective as well.  WordPress sites get hacked all the time, it would be nice to have the protection of a point in time copy of your instance plus your WordPress data that you could roll back to in case things your site gets compromised.  A website is public facing, it is very likely someone could scan your site for vulnerabilities and hack it.  For me it would be more a nuisance than anything else but I still don’t want to lose hours of time spent that I could have easily avoided against with backups.

WordPress – what to backup?

WordPress is pretty simple, we only have to protect a few things:

  • PHP – WordPress is written using PHP as the scripting language, we’ll need a good copy of the PHP config
  • Apache –  Apache is the web server that runs WordPress and serves up web pages, we’ll need a copy of the Apache config
  • mySQL – mySQL is the database used by WordPress for data management, this is an important component of our backup because it contains our website data
  • Lightsail instance / operating system – Our website won’t run without an operating system, we need a way to keep a point in time copy of our Linux OS

If you are running WordPress on a hosted site you probably don’t need to worry as much about backups.  Your service provider is probably running your WordPress backups and charging you for the service.  However, if you are running a Lightsail instance with a preconfigured WordPress application you do need to worry about protecting your site and OS.  Why?  Because AWS is simply giving you a Linux instance running in their cloud with a preconfigured Bitnami WordPress app installed.  AWS doesn’t take care of your backups, you have to do this yourself and this is part of the butt weld

What about still using a WordPress backup plugin?  You certainly could use a free or paid WordPress plugin to backup your WordPress site.  These usually take care of the all the config files and the mysql database that makes up WordPress.  Personally, I’m not terribly enthused with WordPress plugins, they expose your site to vulnerabilities, they constantly need patching, they get abandoned and deprecated over time by the developers, and they just seem clunky to me.

Also, what if my Linux operating system is corrupt or hacked?  With a WordPress backup plugin, I would have to redeploy my instance from scratch with a blank WordPress installation and then try restore using the plugin.  I want a copy of my Linux instance so I have less work to do in the case of a disaster.  If my instance vanished tomorrow, I just want to bring everything back in a few minutes without reinstalling a bunch of packages, plugins, then manually restoring.

Getting a clean copy of your WordPress site in a backup

You might be tempted to just take a Lightsail snapshot of your running WordPress Linux instance and call it a day. Snapshots are a point in time clone of your Lightsail instance disk that can be used to spin up a copy of your instance in case of disaster or to launch a second identical instance.  You can create snapshots from the Lightsail console, the AWS API, or the AWS CLI installed on your instance.  I might sound paranoid for this, but an operating system snapshot isn’t adequate for my WordPress backup.  Just look at the Bitnami documentation for creating a full backup of your WordPress site.

PHP, Apache, and mySQL work together to run your WordPress site.  The operating system coordinates the compute, memory management, networking, and disk I/O subsystem.  If you try to just take a LightSail snapshot of the operating system, you may have a transaction in flight that is sitting in RAM but has not yet been committed to the mySQL database.  Or Linux may have filesystem I/O that has not yet been flushed to disk.  You can’t be sure you are getting a good backup unless you stop your WordPress services and back up the database and config files while the services are stopped.

Would a Lightsail snapshot work as a backup?  Sure, you can take an existing Lightsail snapshot and create a new instance from that snapshot.  And most likely your PHP, Apache, and mySQL would start up just fine and your website would work just like it did when you took the snapshot. But there is a slim chance that you could experience some type of problem where your PHP, Apache, or mySQL doesn’t start correctly.  The industry lingo for this concept is a crash consistent versus application consistent snapshot.  Lightsail snapshots are crash consistent since the operating system will boot up but not application consistent (the app may not start correctly).  We want application consistency for WordPress and crash consistency for our Linux image.

Application consistency with WordPress

Luckily you can easily get a backup of your WordPress site and use a Lightsail snapshot of the instance to keep everything in an AWS snapshot.  This involves a quick site outage but will guarantee your mySQL database and WordPress files are all quiesced with no pending writes.

The WordPress app bundled in (Linux) Lightsail is nicely contained in the directory /opt/bitnami.  Everything related to WordPress resides in that directory.  So all you have to do to backup WordPress is to stop all the services, create a backup tar image of the directory, and then start the services back up.

Once you have a tar image of /opt/bitnami you can then take a snapshot of your Lightsail instance.  This will create an AWS snapshot (stored in S3) of your instance that contains a tar file of your entire WordPress site.  If your instance got hacked or corrupted, all you’d have to do would be to create a new instance from the AWS Lightsail snapshot, stop the WordPress services, clear our the contents of the /opt/bitnami directory, untar the  backup file to /opt/bitnami, then restart the WordPress services.

How to backup WordPress on Lightsail

We are going to do this with the CLI so first you’ll need to download your SSH key and SSH into your instance.  Unlike EC2 instances, your SSH user  is ‘bitnami’ instead of ‘ec2-user’.  Make sure you turn on SSH access on your instance firewall in the Lightsail console.

 

ssh -i YourLightsailKey-region.pem bitnami@<yourPublicIPaddress>

Install the AWS CLI tools so you can take snapshots of your instance from the command line.  That way you can script the backup process in ‘cron’ later if you want to automate this process.  Don’t install the AWS CLI with ‘apt-get’ since you’ll get an old version that doesn’t include Lightsail tools.  Install 952-333-8676.  You’ll know if you have the right AWS CLI version if you see the option to run something like ‘aws lightsail help’.

$ aws --version

aws-cli/1.15.80 Python/2.7.12 Linux/4.4.0-1060-aws botocore/1.10.79

Now run ‘aws configure’ to connect the AWS CLI to your account.  Use your access key and secret key to connect and then pick the region where you have your Lightsail instance hosted.  You should now be able to list your instance names (save the name for later).

$ aws lightsail get-instances | grep name

            "username": "bitnami", 

            "name": "pebblesandweeds-512MB-myregion", 

                "name": "running"

Create a directory where you’ll store your WordPress backup tar file, I’m using /home/bitnami/backup.  Now stop all WordPress services (php, Apache, and mySQL).

sudo /opt/bitnami/ctlscript.sh stop

Now that everything is stopped, tar up everything in /opt/bitnami into a tar file in /home/bitnami/backup (or whatever backup directory you created).

$ pwd
/home/bitnami/backup
sudo tar -pczvf application-backup.tar.gz /opt/bitnami

Now start start WordPress again to get your website back online.

sudo /opt/bitnami/ctlscript.sh start

Now create an instance snapshot for a crash consistent AWS snapshot of your running instance that contains a full WordPress site backup file embedded in the snapshot.  You’ll need your instance name as an argument (and the region if you are working with a region different than the one you gave in ‘aws configure’).

aws lightsail create-instance-snapshot --instance-snapshot-name my.latest.snapshot --instance-name pebblesandweeds-512MB-myregion

As an extra measure of safety, I’m going to also move my backup file to another S3 bucket in my account so I have a second copy.

$ aws s3 cp /home/bitnami/backup/application-backup.tar.gz s3:/mybucketname

You probably want to only store the latest Lightsail instance snapshot to avoid getting charged for storing many snapshots, you can easily remove old snapshots with the CLI or the LightSail console.  The entire process can be scripted and scheduled as an automated process as well.

Thanks for reading!

AWS Lightsail – what is it good for?

I recently started unaghast and was looking at various platforms for hosting a WordPress site.  There are plenty of blog posts showing you how to get a 3373083848 up and running on AWS (903) 276-2003.  Even (305) 828-8258 to point-and-click your way through getting a Lightsail instance running with a preconfigured WordPress application.  This is not one of those posts.

The more interesting question is should I use Lightsail for my blog?  What are the pros and cons?  Is Lightsail an appropriate hosting platform for my blog or other applications?  Can I scale it up or scale it out?  Why would I even use Lightsail, why not just spin up an EC2 instance?  What other alternatives compare to Lightsail outside of AWS?

My goal is to address these questions and give readers a feel for why they would use Lightsail.

**DISCLAIMER**  I do not work for AWS and therefore I am not a Lightsail expert.  I am an end user with a few AWS certifications.  My goal is to take what I have learned in AWS and share with others, not to be the single source of truth on these topics.  I’m still working things out with this blog and I have merchandisable, hit me up on 817-931-0669 or 559-854-8945 with any feedback you may have.

What is AWS Lightsail?

AWS services console

Think of Lightsail as a dumbed-down version of AWS for bloggers and developers who don’t want to deal with the complexity of EC2 compute, EBS storage, and VPC networking.  AWS components aren’t that difficult to use but there is definitely a learning curve to get started.  Lightsail removes that learning curve by simplifying the management and cost to get you productive quickly.  Simply said, Lightsail is cheap, quick, and easy.  

The Lightsail “create instance” wizard lets you spin up a host in the AWS cloud in a matter of minutes without really knowing anything about AWS.  Pick a 5066942596,  select an operating system (Windows or Linux), select an optional open source application bundle or development stack (WordPress, Node.js, Nginx, etc), choose an instance pricing plan, and deploy.  You can even bind your instance to a public facing static IP address and register your DNS name.  Easy right?

Lightsail create instance wizard

Lightsail makes AWS cost budgeting and billing easy since you pay a fixed price per month if you stay within your plan boundaries.  Lets face it – AWS billing is extremely complicated.  Just to run a single small EC2 instance requires spreadsheets and calculators to find the monthly price and that is really just an estimate.  If you stay within your Lightsail plan boundaries you can simply pay your fixed price per month and not worry about cost variations or unexpected charges.  You can overrun your plan and pay more than the fixed price but if you are running small blog or development project you most likely won’t see a nasty AWS bill at the end of a month when using Lightsail.

Is Lightsail really that easy to use?

Spinning up a small instance in AWS isn’t that difficult but you really do need a foundation in AWS to fully understand what is going on.  AWS has their own lingo and quirks when it comes to configuring compute, storage, networking, and security.  Even just to get around in the AWS console you need a class or some type of primer in how the AWS ecosystem works.

Lightsail on the other hand is the easy button for creating and running an instance.  What makes it so much easier?  The Lightsail management console.  This console is a completely separate management interface from the traditional AWS console with most of the AWS complexities hidden from the user.  Choices and options in the interface are kept to a bare minimum to avoid overwhelming the end user with options and lingo.  You can get around Lightsail without reading the manual.

We already talked about how easy it is to create a new instance with the “new instance” wizard.  Your networking options are also simple, you can add/remove static IP addresses, you can create/delete DNS zone, and you can create/delete load balancers.  And you don’t have to touch a VPC, sepsis, or EC2 to get any of this done.

Storage options are also simple.  You automatically get an OS disk for your instance and the size will depend on the plan selected.  For more capacity, you can create an additional disk for your instance with a fixed price per month and attach it to your OS without knowing anything about the many EC2 and EBS options available to you.  And you can create snapshots (backups) of your instance with a few clicks in the Lightsail console.

Lightsail console – storage options

Not only is it easy to create and manage instances but its just as easy to add optional canned applications and stacks when you first create your image.  During deployment of your instance you can optionally chose to roll out WordPress or a LAMP stack during the initial creation of your OS.  This avoids having to SSH/RDP into your OS and doing the manual installation yourself which can save you some time.  You can only use this feature for your initial rollout of your instance, if you want to add packages to your OS later you will have to do so manually.  Rolling out preconfigured software is similar to deploying a prebuilt (562) 906-7958 but is limited to a subset of open source choices available for Lightsail.

So Lightsail is a quick and easy way to get an instance running in AWS with a prebuilt application and/or development stack without much training in AWS.  Lightsail takes all the headaches away to roll out an OS and application to get you productive in a few minutes time.

Lightsail architecture

Lightsail has the same architecture of traditional AWS with most of the details hidden from the user.  In fact Lightsail (to me) seems like an adjacent but logically separate stack of AWS hardware and software with a simplified management interface.  Lightsail resources do not show up in your “normal” AWS console and the Lightsail console only shows Lightsail resources.

Lightsail is supported in a subset of the existing AWS regions (at this time) and seems to use all of the availability zones in those regions.  The available regions are shown in the graphic below, if I look at US-East-2 I can see that all three US-East-2 availability zones are options which is the same as if I went to deploy an EC2 instance.

Lightsail regions and availability zones

Public internet connectivity to your instance is done via a static public IP address attached to your instance.  Your static IP address seems to reside in some special type of Lightsail VPC but it is hidden from you in the Lightsail console.  However, you can peer this Lightsail VPC to your “real” VPCs in the main AWS console.  Which is about the only interaction your Lightsail resources can have with your other AWS resources.

VPC peering to your Lightsail VPC from the AWS console

Storage is the same in Lightsail as it is with EC2/EBS.  Root disks are create during instance creation and are EBS SSD volumes.  Optional additional EBS SSD disks can be attached to an instance after creation for more storage.  Snapshots in Lightsail are supported and can be used to deploy new instances in other Lightsail supported regions or in the same region.

Lightsail instances map to T2 instances with fixed block storage starting points.  An entry level Lightsail instance is a t2.nano, the next step up is a t2.micro, then a t2.small, a t2.medium, and finally a t2.large.  Each Lightsail T2 instance comes with a fixed amount of SSD block storage as the root disk but you can add-on additional disks afterwards without bumping up to the next T2 instance type.

My guess is that the Lightsail infrastructure (storage, servers, networking) resides in selected AWS regional datacenters (availability zones) and is kept both physically and logically separate from the main AWS hardware.  AWS must have built a separate stripped down management interface to manage Lightsail infrastructure but they seem to use the same technology underneath for compute, networking, and storage.  Lightsail seems like the same AWS technology that is limited to a certain set of AWS hardware and has an easier management interface and its own APIs.

Why Lightsail?

You may be coming to the conclusion that Lightsail is for AWS newbies that don’t want to learn how to spin up a proper EC2 instance.  Ease of use is definitely a factor when looking at Lightsail and the automatic app stack deployment makes pushing out a prebuilt app image even easier.  Building an EC2 image while learning the AWS ecosystem can be frustrating if you have to go back and redo your work, Lightsail hides the details and lets you get working on your blog or code faster than trying to figure out how to build instances in AWS.

But the other compelling Lightsail value is the price.  At the time of this writing, you can get a Linux VM running WordPress completely configured and running in a few minute for $5/month which includes 512MB RAM, 1 vCPU, 20GB SSD, and 1TB transfer (egress).  This is the minimum bare bones plan, the largest instance you can run in Lightsail is 8GB RAM, 2 vCPUs, 80GB SSD, and 5TB transfer (out) for $80/month.  Still a pretty good deal.  The first month is free if you chose the $5/month plan which means you can spend a month testing the smallest instance to decide if you need something larger.

What if I pick a wimpy instance plan and want to scale-up?  Scaling up simply means adding more and more resources to the same host, think of just bolting on more CPU/RAM/disk to make the existing instance beefier.  You can always add more storage (disk) to an existing instance, you’ll just pay for the additional storage and don’t necessarily need to upgrade your plan.  But to increase CPU, RAM, storage, and egress you can simply take a snapshot of your instance and then use the snapshot to launch a new larger instance with a beefier instance plan.  All of which is really easy in the Lightsail console and can be done in a few minutes in the wizard without much effort.  Of course you will pay more for the additional CPU/RAM/storage resources but its still fairly cheap and predictable when compared to building your own EC2 instance.

What is Lightsail good for?

AWS is going after individual bloggers and developers who don’t want to deal with the complexity of dealing with the full blown AWS console and ecosystem.  Companies like DigitalOcean were doing well in this niche and Lightsail is a direct attack on DigialOcean business.  AWS is now competing for small scale business, individuals or (very) small businesses who want a tiny dev/web footprint in the cloud for hobby site and dev/test.

All of which means to say Lightsail is good for small business sites, blogs, and development projects.  If the specs for a single instance meet your requirements it is much easier and cheaper to use Lightsail at a predictable price than it is to use EC2.  Once common complaint about EC2 is the hidden charges bewildering variable prices, Lightsail bundles up a very easy to deploy instance experience and charges you a nice fixed price per month with limited features that are very easy to use.  So it is a great choice for small blog sites like this one.

What is Lightsail not good for?

I wouldn’t use Lightsail for production workloads that need enterprise or web-scale type features.  You might love the ease of use and the price and decide to move some prod workloads to Lightsail – don’t.  Why?  You cannot follow the AWS guidelines for a well-architected application framework.  Here are a few highlights so you don’t have to read the entire document.

A Lightsail instance is a single point of failure (SPOF)

No matter how you build a single Lightsail instance it is a SPOF.  There is nothing magical about AWS, your instance lives on a physical server in an AWS datacenter (AZ) and your instance will go down if that server has problems.  If your instance OS has a problem or your application crashes you will have downtime.  All the same things that happen to servers, operating systems, and applications in your datacenter will happen in AWS.  So its probably not a great idea to run (in production) a single monolithic app on a single instance in Lightsail if you expect any resiliency.

Yes you can implement a Lightsail load balancer in Lightsail and increase your resiliency within an AWS region but you will most likely be over-provisioning at that point.  Do you want to run (and pay for) two Lightsail instances 24×7 so support your application?  Public cloud and AWS give an incredible amount flexibility in designing a resilient application that goes above and beyond just two instances behind a load balancer.  It doesn’t make much sense to do this in Lightsail unless you are doing a proof of concept or development and can afford to pay double for Lightsail.

Lightsail is not web-scale

Deploying apps in the cloud is all about agility and scale.  You don’t get these with Lightsail, you need to architect for scale using EC2.  Lightsail only scales up, you can bolt on more CPU/RAM/disk resources to your instance but you until you hit a point where you can’t add any more resources to the largest Lightsail instance.  Lightsail doesn’t spin down or turn off when it is underutilized.  Lightsail doesn’t spin up additional instances when certain metrics are hit (CPU, network traffic, etc).  So really Lightsail is not intended to be web-scale, won’t scale up or scale down, and isn’t ideal for building a production cloud application.

One thing I haven’t explored is how to integrate Lightsail with the rest of the AWS ecosystem (RDS, DynamoDB, Lambda, etc) but I’d imagine it isn’t well integrated even if it is possible.

You can’t easily convert Lightsail to EC2

What happens if you love your Lightsail instance so much you want to deploy to production in EC2?  Well, this isn’t really a use case and as far as I can tell  at the time of this writing it isn’t really possible.  Amazon keeps Lightsail intensionally separated from other AWS services and I haven’t seen an easy way to move Lightsail snapshots over to EC2.  I’m sure there are some creative ways to do this that I would categorize as hacks, maybe AWS will implement this feature in the future.  So don’t plan on using Lightsail as your test/dev environment since Lightsail snapshots currently aren’t very portable over to EC2.

Conclusion

I’m running this blog using an Lightsail instance with WordPress.  I wanted a fixed price per month without worrying about my bill fluctuating every month.  I wanted to get a blog up quickly without having to spend too much time getting all my AWS components configured correctly and then getting WordPress installed correctly.  And I the Lightsail console ease of use was very attractive since I was already familiar with the complexity of the AWS ecosystem.  Lastly, I don’t care about resiliency since this is just a hobby blog and I don’t expect lots of traffic and I’m not losing money if my site goes down temporarily.

Lightsail has met all of my expectations thus far, stay tuned for more posts!

 

 

 

 

Hello world

“Hello world” blog posts are lame.  They are not interesting and I avoid reading them.  But I need to post this one.

Why?  Well, I’m in the IT industry and I’ve blogged on my previous employer’s communities site.  I have recently left that company and joined a new one.  The point-and-click corporate blogging experience I used previously made it easy to push out content but I wasn’t crazy about the rigid formatting or the need to keep writing content related to products sold by my company.  Now that I have I left I want the freedom to build my own site and blog about whatever.

So what is the best way for an IT infrastructure guy to start blogging?  I’ve recently gotten interested (and certified) in AWS.  Mind you, I’m not an AWS expert but I’ve gotten through the learning curve and can get around AWS fairly easily.  So I figured I’d find a way to spin up a blog in AWS to keep my skills sharp while working on the site.

I started looking around AWS and found Lightsail.  Lightsail has its pros/cons which I’ll continue to explore in future posts but it takes some of the drudgery away from getting an instance running in the AWS cloud.  That ease of use comes at the expense of scalability, resiliency, and performance.   But it is quick & easy to run and will scale up (a bit) if you wind up needing more resources.

I’m specifically using (620) 584-7268.  WordPress for me is a work in progress, I haven’t worked with websites since the late 90s.  So I’m using a canned WordPress template and working out the kinks as I go.  Some things work, some things don’t.  And the formatting looks pretty goofy.

Which all means to say that this “hello world” post is a way to help me work through the setup with a short intro post.  I’ll do more posts about various things soon and get everything working.

So there it is, a short post to help get this blog off the ground.  It isn’t particularly pretty or creative, but thanks for reading anyway.