The initial website Etsy began its work in 2005, since then we hosted Etsy.com as well as other similar services in self-managed data centers. Some time earlier we came up to the conclusion to estimate the migration of all our data into a cloud. Then, it seemed sound to use our own hardware in data centers, however, later the offerings of IaaS and PaaS greatly changed. We had to have another look at the decision made. Google Cloud Platform (or GCP) became our chosen cloud provider, which we are very glad about. This is Etsy move to a service provider known as the best one in its class. The given change gives us the opportunity to have more time for strategic features and services, improving the marketplace of Etsy.
Here the cloud provider is called vendor, though, we do not take this process as just vendor choosing one. Actually, we create long-lasting partnership. Our selected provider is going to play a tide-turning role in our successful primary migration. Moreover, they will be essential for site and services scalability and availability. To make this decision we gave additional thought to it and carefully analyzed everything. In this article you will find out the way we’ve chosen our partner. Though, the goal of the article is neither name the reasons of migrating nor clarify the business aims chosen for noting the project success.
Many resulting from one
Migration to a cloud is a one huge project, however, it consists of many projects smaller in size. If we want to have a clear view of every cloud provider, it is necessary to single out the subordinate projects, find out their specifications, and finally use the given specifications in order to consider diverse cloud providers. In addition, to cover the whole project, each minor project should be learnt in details (e.g., determine its dependencies, timing and priority).
The first step to be made was to find 8 main projects, with the site production render path, a search service for the site, logging, as an example of the production support system, and finally Tier 1 business systems, for instance, Jira. Then, these projects got another division, this led to forming component projects, such as MySQL and Memcached. In the end, more than thirty subprojects were singled out. We called together the experts from the whole company, in order to identify each subproject’s requirements. Gathering the given requirements wasn’t success for many. We are such an example, since the awareness of both MySQL databases latency tolerance and API data warehousing requirement was needed with aim to make up and delete some data. RACI model aided to unite the requirements and as a result to determine subproject ownership.
RACI
RACI is a model which helps to find the people for every single subproject, who obtain such qualities as responsibility, and accountability, and appear to be consulted and informed. The following definitions are used here:
- Responsible person is a person whose main responsibility is to complete the initiative or project.
- Accountable person is the one who gives his approval of the work performed till it is completed.
- Consulted is someone who obtains the data necessary for completing the project.
- Informed person is someone whose approval of any point in the work is not needed, though he is the one to be abreast of the process.
So, every Responsible person had a load of requirements as well as a map of subproject dependencies, while accountable person had to make sure that the responsible one didn’t run out of time, resources and info necessary for completing the project. Additionally, they are the one to sign off on the project accomplishment.
Architectural Review
While a peer review is used for any great change in the environment of ours (either technology or design), Etsy has a long practice in using an architectural review. No need to underestimate this process since it demands much time from the side of senior engineers. Having combined the power and knowledge, the company experts came up with the documents of over thirty pages, dealing with architectural review.
Though, we realized in order to evaluate diverse cloud providers it was necessary to have a clear view of different parts in our system. Speaking about our practice, we make use of a custom toolset to make up bare-metal servers and VMs, in order to keep provisioning in the colocation centers. Other important things for us are Chef roles and recipes, which are crucial for provisioned bare-metal servers and virtual machines. We managed to come up with several key aims for selecting tools, which aid to create infrastructure in the cloud, they are: high flexibility, accountability, security and finally centralized access control. Having tried some of these tools, the provisioning team of our company came up with a new workflow through the prism of architecture review. In the conclusion, it was stated to use Terraform and Packer combined together, in order to make up the base OS images.
In general, more than 25 architectural reviews were held for main parts of both our system and environments. Additionally, we reckoned that some parts needed a deeper review, that’s why we created 8 more workshops. Giving more real examples, we paid additional attention to latency constraints and failure modes while reviewing the backend system, which takes part in generating etsy.com pages. As a result of this type of reviewing there has been created a certain set of requirements useful for estimating multiple cloud providers.
The way it works together
Having come up with certain set of requirements for the main parts of our system, we started mapping out the migration. Before everything else, our task was to find out the way the constituents were interrelated. So, as a result the teams of engineers were seen near the whiteboards, drawing and graphing dependencies and paying attention to the interaction of systems and subsystems.
The result appeared to be a success since we managed to write down the supporting constituent of the main components, including the tools for scheduling and monitoring, cashing pools and streaming services. This led to high-level estimates of project timing and effort, graphed in Gant-style project plans.
Experimenting
Not long ago we had a hands-on experience with cloud providers, namely, we used there some Hadoop jobs, in the end we have a clear view of the problems we may face while migrating, especially, in scaling. Though, it was impossible for us to see the big picture of our chosen cloud provider, since we did not make use of GCP.
That’s why later we decided to experiment and run batch jobs on GCP with using Dataproc and Dataflow. This experimental green fruit, since we realized that some services were still in its initial releasel and, moreover, could not sustain the workload and SLAs required. The question “create our own tool or eventually make use of a cloud service” was the first one from the row of similar questions to answer. Here we made a decision to use Airflow on GCP VMs. So, in order to make the right decisions, we took into consideration several criteria, namely, how the service is supported by vendor, his independence and the way the teams are influenced by the decisions made.
However, no answer will be correct in this case, since it depends on the team. Though, it’s not our final word, since if we have additional information or other projects available in the future, we’ll come back to this question.
Meetings
The course lasted for 5 months and during this time we had a myriad of opportunities to meet the Google team. So, we managed to cover a number of topics, starting from getting acquainted with the team to something narrowly focused such as the use of containers in the cloud. These meetings let us not only discuss key cases, but improve the engineering relations between Etsy and Google.
Additionally, we had some meetings with the customers who told us about their migrating practice as well as the challenges which they faced in their way. We didn’t waste our time during the meetings with the companies and saw the open culture widely spread in NY technology companies.
We also had meetings with Etsy key stakeholders, with the aim to keep them abreast of the progress we had made and allure them to take part in making some decisions, for instance, the right way to balance the mitigation of financial risk with the time-discounted value of cost commitments. So, as a result we had someone to make decisions who were acquainted with the work and could not be dissapointed at the final stage.
The Decision
Having passed the long way, stakeholders, vendors and engineering teams provided us with a huge number of data points. We involved a decision matrix, i.e. a tool for evaluating problems of multiple-criteria decision, which aided to manage and prioritize the given data, so it could help to give an unbiased evaluation of every offering as well as proposal by vendors. More than 400 scores were estimated by the decision matrix, which in its turn included more than 200 factors, which were prioritized by a number of weights (namely, 1400 of them).
The beginning was marked by determining the integrated functional requirements. We specified some points, namely, relationship, cost, how easy it is in use, value-added services, safety, position and connectivity as the 7 to-level functional requirement areas. The next step was to enumerate every requirement of the customers (which comprise more than 200) and evaluated them according to the support they give to the overall functional requirements.
To show the support level, we put into practice 0, 1, 3, 9 scale. So, for instance, the autoscaling support (one of the customers’ requirements) got 9 for cost (since, this one could help to cut the cost by fast scaling our compute cluster), the same score of the operability (so, we could avoid having manual spin up and down VMs), for value-added service it got 3 (on the one hand, the service offered something more than just basic compute and storage, however, it was not greatly distinguished), and finally, 0 for other functional specifications support. These factors were deeply evaluated by a number of engineering teams, so the evident priorities, which appeared in the result of the nonlinear weighting scale, made even some conservatives to make decisions.
Latter on we ranged each vendor’s offering with the help of the requirements. To see the way the cloud vendors met the needed requirements, we once again used a 0, 1, 3, 9 scale. So, if to have another look at the example of autoscaling support, then each cloud vendor got 9 since this requirement was completely met (namely, autoscaling was supported as native functionality). Each vendor in general got more than 50.000 points, only GCP got 10% more than the rest of them.
No need to forget, however, that Etsy’s decision matrix can be used only by Etsy. We are not trying to say that the decision is applicable to you and your organization, though, we want to give you the insight of our approach, which is of paramount importance to us.
Only the beginning
It’s only the beginning, since the process, which has taken us about 5 months and has involved a great number of workers from different spheres, appears to be the first step in the project of migrating to GCP. We have around 2 years do fulfill this task, during which we also are going to stick to the innovative product features and reducing the risk while transiting. And finally, we are waiting for the opportunities provided by migrating to GCP, so we can pay more attention to services for the Etsy marketplace by means of partnership with the best service providers.