We go in-depth on how GPT works, which is pretty exciting if you’re curious about exactly how ChatGPT sounds so much like an all-knowing real person. It’s not magic; you’ll learn how the Transformer architecture works, and how multi-headed self attention unlocked the key to training systems like GPT in parallel on massive amounts of training data.
You’ll also practice using the OpenAI API, allowing you to use GPT, ChatGPT, and DALL-E’s capabilities within your own applications.
These new lessons are chock-full of professionally designed illustrations, and new hands-on activities using Google CoLab, HuggingFace, and lots of our own code. There are a couple of really fun activities – we’ll fine-tune GPT using real IMDb movie reviews, and create a system capable of talking in depth about movies. And even more fun, we’ll create our own version of Data from Star Trek by fine tuning OpenAI’s Davinci model with all of the scripts from Star Trek: The Next Generation. Just imagine: we’re creating a real AI that mimics a fictional one from the 80’s!
These are exciting times in the world of AI, and understanding how the latest AI systems work will really make you stand out with your employers. Enroll now if you haven’t already.
This article is an excerpt from our “Mastering the System Design Interview” course. For this section we will dive into a brief review of the various cloud computing technologies out there, and how they connect to the system design interview.
One thing that’s changed in system design interviews is that it’s not always necessary to design things from scratch. We don’t always have to assume that you’re going to be designing your own layout of servers in your own data center. Oftentimes, you can just use an existing technology within one of those cloud service providers like Amazon Web Services or Google Cloud, or Microsoft Azure. And sometimes, that might be a perfectly appropriate thing to invoke, and it can save you some time and trouble. So, let’s get started!
Again, these are just tools in your toolbox that you can draw on during a given system design problem. I’m not going into a lot of depth here; I could spend hundreds of hours talking about each one of these services if I wanted to. The objective here is to know these services exist and you can call upon them as needed as part of your design.
I’ve made three columns in the chart above, one for Amazon Web Services, one for Google Cloud, and one for Microsoft Azure. They all have their own offerings for these basic general classes of services.
Let’s start with storage. You have to put your raw data somewhere, right? If you’re being asked to process a massive amount of data, that must start in some location. These storage services can store pretty much anything (technically they are “object stores”.) Unlike a database, they are not limited to structured data.
AWS’s storage solution is S3, the Simple Storage Service. S3 is just a place where you can store objects across the cloud within AWS. You pay based on usage and the prices are cheap. If you need to store a massive dataset, you can throw it in S3 and then use additional AWS services to process that data and impart structure to it.
Google Cloud offers cloud storage of its own, and Azure has different flavors of storage services. You can ask it for disk, blob, or data lake Storage, depending on what you’re trying to do. There’s that “data lake” term again. That is the concept of storing a massive amount of unstructured data somewhere, imparting structure to that data, and querying it as if it were structured. A data lake needs a massive storage solution like S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage to store that data in the first place.
Let’s also talk about compute resources. If you need to provision individual servers and you want to have complete control over what those servers are doing, they all have solutions for that as well. Amazon offers EC2 which allows you to rent virtual machines as small or large as you want. That can even include different flavors of boxes that might focus more on GPUs than CPUs or might focus more on memory or storage speed. Whatever it is you need to optimize for, they have a specific server type you can choose from. If you’re doing deep learning, you might want to choose one of their big GPU instances to throw the most muscle you can at a big deep learning problem (they won’t be cheap, though).
Similarly, Google has Compute Engine, which is the same idea. And Microsoft Azure just calls their offering virtual machines. Every cloud provider has a solution for renting virtual machines on an as-needed basis and being charged by the hour for how much you’re using them.
If you need a big NoSQL distributed database, we can do that too. DynamoDB is the go-to solution for that on AWS. Google Cloud still calls it BigTable, and they have some more specific services for more refined use cases. Azure has something called CosmosDB or Table Storage. All three providers offer a distributed NoSQL data store that will allow massive scaling of key/value lookups.
Containers are also a big deal. If you want to deploy code to the outside world, putting that within a container is a modern operational practice. These days, Kubernetes is winning the battle versus Docker for what’s popular on the cloud services. All three services offer some sort of Kubernetes service. On AWS, they call that Kubernetes on ECR or ECS. Google Cloud also offers Kubernetes, and Azure as well.
They each offer solutions for data streaming as well. You can always just run Kafka or Spark Streaming on a compute instance or on Amazon’s Elastic MapReduce (EMR) service. But there are also managed, purpose-built services for streaming. AWS has something called Kinesis that’s used for data streaming, which integrates tightly with other AWS services. That’s just used for getting data from one place to another, and maybe transforming it and analyzing it along the way. Google Cloud calls the same thing DataFlow, and Microsoft Azure offers Stream Analytics.
We can also briefly discuss Spark and Hadoop. How would I deploy them in the public cloud? On AWS, they have something called EMR, which stands for Elastic MapReduce. The name is a bit of an anachronism because you can use it for much more than MapReduce these days. Specifically, you can also deploy Apache Spark on it, as well as other streaming technologies and analytics technologies. But the nice thing about EMR is that it manages the cluster for you. You just say, “Hey, I want a Spark cluster with this many nodes. Go create it for me”. And EMR says, “Yup, here you go. Here’s your master node. Go run your driver script here. And it’s all set up and ready for you.” EMR saves you a ton of hassle in provisioning and configuring those servers. You just get a Spark cluster that’s ready to go.
Similarly, Google Cloud has something called Dataproc, and on Azure, they have an implementation of Databricks. Databricks is a very influential company in the world of Apache Spark and a big contributor to Spark itself. If you’re a fan of Databricks, Microsoft Azure might be your platform of choice.
For larger-scale data warehousing, they all offer solutions for that as well. On AWS, we have something called Redshift. Again, you just tell it, “I want to provision a data warehouse that has this much storage capacity,” and it says, “Okay, here you go, go to town.” It also has a variant called Redshift Spectrum, which can sit on top of an S3 data lake and issue queries on unstructured data as well. Google Cloud still offers BigQuery, its original technology for distributing SQL queries or queries in general, across a massive dataset. And on Azure, we have Azure SQL or Azure Database.
Finally, let’s talk about caching. On AWS, we have something called ElastiCache, which is just a wrapper on top of Redis. And on Google Cloud, they call it Memorystore, which can be Redis or Memcached under the hood. Azure offers a Redis solution as well. It seems like Redis is winning the battle against Memcached in the public cloud. All three platforms allow you to deploy your own Redis server fleet and manage it for you.
No matter the system the goal is always the same. If you’d like to learn more about any of these cloud computing platforms before you’re systems design interview. Enroll in our courses at www.sundog-education.com
The Secret to Nailing Your System Design Interview
By Frank Kane
This article is an excerpt from our course, Mastering the System Design Interview. We hope you find these tips useful in your next interview.
So what is the secret to nailing your system design interview? Well, it might be simpler than you think. One of the top ways to nail your System Design Interview is to simply THINK OUT LOUD!
Yes, you read that right: think out loud during your interview. Don’t just clam up for 10 minutes while you think about things. For all you know, you only have 15 minutes for this question, and you don’t want to spend 10 minutes of it just sitting there in silence while your interviewer wonders what you’re thinking. Do yourself a favor, and do NOT do that.
Think out loud. Use that time to clarify requirements, define the constraints of what you need to build, and then think out loud about the high-level solutions you’re considering to meet those requirements. Say what you’re thinking. “Okay, well, I think we need a CDN, and maybe we need to use Apache Spark to process the data down here, and maybe Kinesis to stream stuff over here, and maybe DynamoDB over here to store that data.” Just talk and think, and let the interviewer see how you’re approaching this problem from a technical standpoint.
Thinking out loud is important because it gives the interviewer not only a chance to see how you think but also to steer you in the right direction and see how you respond to that guidance. Part of what they’re trying to evaluate is what it will be like to work with you every single day. So, work with them. Show that you can work with them. Think out loud. Let them work collaboratively with you. Show that you can take feedback. Show that you can modify your design in response to that feedback, and not get defensive about it. That’s a much better use of that 10 minutes than just sitting there in silence.
Again, you don’t know how much time you have for a given system design problem. Very often, it’s only about 20 minutes. If you spend half of it just sitting there in silence, you’ve wasted that time. You’ve wasted that opportunity to show your interviewer how you think.
So here’s the key takeaways:
Don’t just clam up for 10 mins while you think about things.
Clarify requirements, and define the constraints of what you need to build.
Think out loud about the solutions you’re considering to meet those requirements.
Give the interviewer a chance to steer you in a different direction before you start diving into details.
You don’t know how much time you have for this part of the interview, so make every minute count.
If you’re interested in learning more strategies on how to Master your System Design Interview today with our Mastering the Systems Design Interview Course.
One of the best ways to prepare for an interview is with mock questions. Having thought through potential interview questions will help you to better articulate and position your answers.
In this article, I will walk through mock interview questions that are often asked in the Systems Design Interview to help you prepare your answers before the interview.
This set of mock interviews features questions I’ve asked as an interviewer, questions I’ve seen other interviewers ask, and questions I’ve been asked while interviewing at big tech companies myself. For each one, we’ll show you the right questions to ask before you dive into a solution and give you a chance to sketch out your own system design.
Then, we’ll present a transcript of how a real interview might go and show you what a good interview for this question looks like. Finally, we’ll debrief after each mock interview and talk about what made that interview successful, and what you should learn from it.
These are also opportunities to gain experience in how the various technologies we’ve discussed earlier in the course fit together. Practice makes perfect, and that applies to interviewing as well!
Mock Interview: Example #1
Question: Design a URL shortening service.
CANDIDATE: OK, so we’re talking about something like bit.ly, right? A service where anyone can enter a URL, get a shorter URL to use in its place, and we manage to redirect them?
INTERVIEWER: Yup, at a very high level, that’s the idea.
CANDIDATE: What sort of scale are we talking about?
INTERVIEWER: A lot. Say millions of redirects every day. And we don’t want to
make any design decisions that might limit us later, so assume millions of URL’s as well.
CANDIDATE: Any restrictions on the characters we use? Symbols might be a little too hard for people to remember or type…
INTERVIEWER: It’s good that you’re thinking about usability and the customer experience. Yeah, symbols would be a pain, as would be remembering the capitalization of characters and stuff. But, would that limit you too much? Does that give you enough characters to work with?
CANDIDATE: Well, how short is short?
INTERVIEWER: The shorter, the better. How many characters do you figure you’d need?
CANDIDATE: Well, if we use nothing but lowercase letters and numbers to make them easy to remember… that’s 36 characters, right? So we basically have a base-36 system here. Personally, all I can remember would be 6 characters, so how many URLs could that represent? Whatever 36 to the 6th power is… mind if I use the calculator on my phone for that?
INTERVIEWER: Sure, I can’t do that in my head either.
CANDIDATE: Let’s see… oh wow, that’s over 2 billion. So yeah, 6 characters should be plenty for the foreseeable future.
INTERVIEWER: Sure, sounds good. Any more questions?
CANDIDATE: How about vanity URL’s? Can people specify their own URL if it’s available?
INTERVIEWER: Yeah, that would be nice to have. Might be something only registered users or paid users get.
CANDIDATE: Do we let them edit and delete short URL’s once created?
INTERVIEWER: If they have an account, sure. We don’t want people editing or deleting other peoples’ URLs.
CANDIDATE: How long do shortened URL’s last?
INTERVIEWER: Well, forever. We don’t want a bunch of dead links out there 5 years from now. Good thing you’ve got room for 2 billion URL’s!
CANDIDATE: Let’s start by thinking about the API’s to this system.
We’ve asked some clarifying questions here, and you have enough to get started. So, before we go into the actual mock interview and see how that goes down, try it yourself. Get a piece of paper, and sketch out some designs.
Here are some questions to ask:
How would you implement the system?
What API’s do you think will be needed?
How will you work backward from those API’s to develop a system that can work at this massive scale, and handle both the storage of those mappings and the redirects?
Go give it a shot.
Mock Interview: Example #2
Question: Design a Restaurant Reservation System
CANDIDATE: Ok, you want me to design a restaurant reservation system. Is this just for one restaurant, or for any number of restaurants like OpenTable or something?
INTERVIEWER: It’s like OpenTable, so it can cover many restaurants.
CANDIDATE: All right, let’s think about the user experience first. A user will want to select a restaurant, enter their party size, find a list of available times near the time they want, lock in their reservation, and get some sort of confirmation via SMS or something. They’ll also need some way to change or cancel reservations.
INTERVIEWER: Yes, that’s good. There are some nuances we could talk about, but you’ve got the main operations we need to support there.
CANDIDATE: So there are probably thousands of restaurants out there that might be a part of this system, and tens or hundreds of thousands of diners. They’ll expect this system to be fast and reliable. Am I right in thinking we should optimize for performance and reliability over cost?
INTERVIEWER: Yes, I want you to design a system that is both scalable and reliable, and with fast load times. Assume some investor gave us millions of dollars, and money isn’t really a problem.
CANDIDATE: I suppose the restaurant is also a customer…what would they need? Reporting, analytics, a way to set up how many tables and their configurations, how many tables to hold aside for walk-ins, a way to contact reservation holders…
INTERVIEWER: Yes, good thinking there. In the interest of time though, let’s just concern ourselves with the diners, and what we need to build in order for them to successfully schedule a reservation at their favorite restaurant.
Again, it’s time to try it yourself before I walk you through the mock interview.
Here are some questions to ask yourself:
How would you organize the data that’s needed for this system?
How would you structure that data?
How would you store it?
How would you distribute that storage, and how do you design a system, more generally, that would scale to thousands of restaurants and hundreds of thousands of users?
Take a stab at that yourself on a piece of paper somewhere, or your own whiteboard or virtual whiteboard. And when you’re ready, come back, and we’ll see how our interviewee here actually handled the problem.
CANDIDATE: Let me sketch some thoughts on the data we’ll need while I’m thinking of it…So we’ll need a customer table, and a restaurant table for sure. We’ll need to tie them together so each customer and restaurant will need some unique ID associated with them. What might we need to know about a customer…certainly their name, contact info, and maybe some information to help them find their favorite restaurants or restaurants close to them. So we’ll need their location as well, and maybe a list of their preferences, like their favorite restaurants. We’ll also need to store their login credentials, but this would probably be stored in a more secure system or using some single sign-on system, and not here.
For the restaurant, we also need its name, address, and contact info. We also need to know its layout so we can match up reservation requests to available tables. The application we build will have to have some fairly complex logic for assigning reservations to tables; maybe even taking into account the possibility of moving tables together to accommodate large groups. We also need to make some assumptions about how long it takes for a dining party to finish their meal and clear the table for the next reservation, so that’s something the restaurant will probably want to be able to control – the length of time a reservation lasts. Maybe that ends up being a function of the party size as well or the time of day; we’d have to interview real restaurant owners to understand how to best model that. I assume they’ll also want to keep some tables aside to handle walk-in customers, so we should at least let the restaurant specify how much capacity they want to hold back for walk-ins.
INTERVIEWER: That’s great; you’re really thinking of the customers here and what they will need.
CANDIDATE: So, finally, we’ll need a reservation table that ties it all together. The app will have to use its own logic to assign reservation requests for a given customer, restaurant, and time. So somewhere, we will have a table of reservations, partitioned by restaurant ID so we can quickly look up reservations for a given restaurant. I imagine we’d further partition by date to make it quick to look up existing reservations for a given date at a given restaurant, which the algorithm will need to try and find an opening.
INTERVIEWER: Great that you’re thinking about how the data is stored for optimal performance. So, is there a reason you’re going with a normalized data representation instead of a denormalized one?
CANDIDATE: Well, thinking about the operations we’ll likely need to do…let’s see…you’ll probably already have the customer ID and restaurant ID on the client by the time you navigate to the point where you want to create a new reservation, right? I think it’s simpler to just retrieve information on restaurants and clients as needed via their own hits to the database, or the cache in front of the database. That way we don’t waste space, and we don’t have to deal with the problem of updating everything in some huge denormalized table whenever a customer changes their phone number or something. If, while testing, we find that there is some complex join operation that we’re doing over and over again and it is a performance bottleneck, we could revisit that, but my instinct here is to start simple and only add the complexity of denormalization when needed.
INTERVIEWER: Makes sense to me. Keep going.
CANDIDATE: What information is associated with a reservation…obviously the customer and restaurant it is for, the party size, and the time. We might also want a space for notes to the restaurant, like any special occasions or dietary restrictions they might want to know ahead of time.
INTERVIEWER: OK, that’s all good. Let’s move on to designing the larger system here.
CANDIDATE: So I think the design is pretty straightforward. We have a bunch of clients that represent our diners, running an app or something that needs to issue service requests over HTTP somehow over the internet.
Since we can have a large number of diners, we will need to horizontally scale the servers that process these requests. The act of placing a reservation or retrieving information about a diner or a restaurant seems atomic and stateless, so that shouldn’t really pose a problem. We just have API’s for requesting a reservation and retrieving metadata to display about users and restaurants. There also needs to be some API for securely logging in, creating an account, and stuff like that… but let’s assume we’re using some secure, external system for user management which is outside of what we’re building. Ideally, these servers would be hosted across different racks, data centers, and regions, and geo-routed whenever possible. That would maximize availability, assuming we build in sufficient capacity to handle an outage of an entire region.
And I’m going to draw a hand-wavy, big “NoSQL” database here that stores our customers, restaurants, and reservations tables. The application logic for assigning reservations to time slots will live in the servers that talk to this database. Although I’m drawing it as a single, giant bottleneck, this is really some sort of horizontally scaled database system to ensure it can handle high loads and high availability.
We’ll probably also want to send text messages to people reminding them of their reservations, so we’ll have some application server off on the side querying the same database and firing off SMS messages as appropriate. I’m drawing this as a single server as that probably would be sufficient, but of course, we’d have some sort of failover set up on that as well, maybe with just a cold standby ready to go. This seems like sort of a nice-to-have feature, but if it is deemed critical we could also put it behind a load balancer just to ensure we have redundancy all the time.
INTERVIEWER: I mean, is there really any reason not to do that?
CANDIDATE: No, I suppose not. So, let’s imagine another load balancer and at least a couple of servers in different data centers handling the SMS part.
INTERVIEWER: Tell me more about your big hand-wavy NoSQL database. How would you go about choosing a specific technology for that?
CANDIDATE: Well, part of it would come down to what tools your staff is already familiar with. If you’re an AWS shop, then I would think DynamoDB would fit the bill nicely. But, let’s think about the CAP theorem. You said earlier we care about availability and speed, which implies partition tolerance. So that means we can maybe give up a little on consistency. So, something like Cassandra that has eventual consistency in exchange for not having a single master server might be a reasonable choice. But I think I would push back on those requirements; consistency is probably important for this application, it just isn’t something we talked about yet. We definitely don’t want two customers ending up with the same reservation slot. I mean, in practice, even the databases that trade-off availability are still highly available if you throw enough secondary servers and backup master servers at them. So the usual suspects like MongoDB or DynamoDB, or its equivalent in Google Cloud or Azure, is probably a fine choice.
INTERVIEWER: Yeah, that’s good. Business owners don’t always think about these things, and part of your job is to help them think about these sorts of requirements and the tradeoffs involved. Now, the data you sketched out earlier is relational in nature – we’ve got customers and restaurants referenced in each reservation. Do we need a traditional relational database like Oracle or MySQL to handle that?
CANDIDATE: No, the application servers can query the individual tables and join them internally as needed. We’re not doing anything complicated where that would be a real performance concern. Modern distributed databases can just do the join for us efficiently on their own anyhow. Let’s go with “NoSQL” meaning “Not Only SQL”.
INTERVIEWER: OK, we just have a few minutes left before I have to move on. One last question: What about caching? Do we need it? How can we further improve the performance of this system?
CANDIDATE: Hm, well, we don’t really have a lot of static content in this system, so something like a CDN probably wouldn’t do a whole lot of good. If the client applications are just web pages, though, we’d probably want a CDN for fast hosting of the CSS, Javascript, and images needed on the client side. We talked about hosting the app servers across different regions and geo-routing to them, so at least that will cut down on some latency. We probably would want to have some sort of cache for the database queries, though. The customer and restaurant data isn’t likely to change often, so that can certainly be cached. Let’s assume we have something like Memcached or Redis sitting on top of those queries inside the app servers. Maybe Memcached because it’s simpler and we don’t need anything fancy here. That gives us a little more flexibility in how the database is distributed across regions as well. It doesn’t do much good to geo-route to servers if those servers all have to talk to one region for its data.
INTERVIEWER: Cool. Obviously, there’s a lot more to talk about if we were to build this for real, but you hit on all of the main concerns. Let’s move on.
Mock Interview: Example #3
Question: Design a Web Crawler
CANDIDATE: We’re designing a web crawler. Like, the entire web – or just a few sites?
INTERVIEWER: Yup, the entire web.
CANDIDATE: I thought you might say that. So we’re talking, like, billions of web pages. Crawled how often?
INTERVIEWER: Let’s say the whole thing should be updated every week.
CANDIDATE: And, we need to check pages we’ve crawled before to see if they have been updated, right?
INTERVIEWER: That’s right.
CANDIDATE: OK, do we need to store a copy of every page as we go? Does that include images?
INTERVIEWER: Yes, we need to store the HTML at least. For now, I don’t care about images, but it would be nice if your design could be extended to handle them later.
CANDIDATE: What about dynamic content? Stuff that’s rendered client-side?
INTERVIEWER: That’s a good thing to ask about. Again let’s set that problem aside for now, but if your design can be extended for it and we have time to talk about it, we can go there later.
CANDIDATE: What’s the main purpose of this crawler? I should’ve asked that first, really.
INTERVIEWER: We’re building a search engine. That’s why I’m mainly concerned with just storing text for now. Now that we’ve answered some clarifying questions and defined our requirements, it’s time for you to try it yourself once again. How would you distribute this crawler to handle the massive scale required? We’re talking about the entire internet here. That’s crazy. What algorithms will you use to crawl the entire web? We need to bring back what we learned about algorithms and data structures. What problems and failure modes can you anticipate and address in your design? Give it a shot on your own and when you come back, we’ll go through a mock interview showing one approach to the problem.
CANDIDATE: OK, let me start by thinking about it from an algorithmic standpoint. Basically, web pages are vertices on a directed graph, right? And the links between them are the edges of the graph. So fundamentally, this is a graph traversal problem.
INTERVIEWER: Right. So, what kind of traversal would you do here?
CANDIDATE: Well, the choices are breadth-first-search or depth-first-search. Let me think about that for a second. The number of links on one page are pretty finite; that would represent breadth. But the depth of the Internet is pretty much infinite. I think that makes BFS the only real tractable solution here.
INTERVIEWER: Remind me how BFS works.
CANDIDATE: So, starting at some page, you’d go through every link on the page, and kick off the processing of each link to some other process in the name of scalability I’d think. Then each link on the child nodes are processed, working your way across this graph from left to right. As opposed to DFS, where we would follow one path all the way to the end, then back up and follow another path all the way to the end. The problem is that following any path to the end will take pretty much forever. BFS is usually the way to go, and this seems like no exception.
INTERVIEWER: OK, good. Let’s get to the hard part and make this scale to billions of web pages.
CANDIDATE: OK, let me start with something simple and high-level, and then we can start refining it.
INTERVIEWER: Yup, that makes sense.
CANDIDATE: So we need to start with a list of URL’s to crawl. We have to start somewhere. Way back at the beginning of the web, webmasters would submit their domains directly to search engines so they would be crawled, so I would guess that’s what seeded this, along with the sitemaps on those sites. Even today people can submit sites via Google webmaster tools right? So there is some process to directly add new URL’s that have no inbound links at all yet into this list of URL’s to crawl.
INTERVIEWER: That could be a pretty big list.
CANDIDATE: Yeah, it’s not going to fit in memory on a single host or anything like that. We’ll probably need to hash each URL as it comes in, and dispatch it to a list on one of many servers to scale that up.
INTERVIEWER: OK. We’ll dig into that more deeply if we have time. Staying high level for now.
CANDIDATE: So then we’ll have another distributed system of some sort that actually downloads all of those URL’s, and stores their contents into some truly massive distributed storage solution. I guess some sort of simple object store will do where the key is just the URL, and the value is the stuff that was downloaded. So something like Google Cloud storage should fit the bill, or if Amazon were getting into the search engine business Amazon S3 would do for that. Designing a distributed storage system is a whole other design problem, so again, I’ll stick with the high level here.
Next, we need to extract all of the links within that page and crawl them in turn. BFS as we said before. I imagine that’s easier said than done; there needs to be some way of normalizing those URL. There’s the whole http vs. https thing, relative links, trailing slashes, and all sorts of edge cases we’ll need to handle. But in the end, we need some canonical URL that we can resubmit to the crawler.
There are also links we might want to explicitly exclude; known malware sites, people hosting prohibited content, and stuff like that. So some sort of filtering will probably also be needed before we decide to crawl down any given rabbit hole on the Internet.
So, if a URL makes it all the way through this, it goes back into the distributed list of stuff that needs to be crawled. Specifically, that will be a first-in-first-out queue sort of thing; a big distributed linked list would do fine.
INTERVIEWER: Why a linked list and not an array?
CANDIDATE: Well, these URLs are strings, and we don’t really know ahead of time how much memory a certain number of URLs will take. Using arrays means we have to pre-allocate space, but we can’t know how many elements will fit on a given server.
INTERVIEWER: Well, you could have an array of pointers to strings, right?
CANDIDATE: That doesn’t really help; you still have to know how many strings you can fit in memory, and we don’t.
INTERVIEWER: Yeah, you’re right. So, is this list really just in memory? What happens when one of your servers goes up in flames? Do we just lose that part of the Internet?
CANDIDATE: Well, arguably, that might be OK – the next time we run the crawler it would pick it up. The simplicity and lower cost might be a reasonable trade-off there.
INTERVIEWER: Let’s say it isn’t; too many people will freak out if their new web page isn’t crawled quickly. How would you solve that?
CANDIDATE: Hm, we need some sort of distributed, persistent list. I guess you could back it on disk in a distributed database of some sort, but maybe you could just have hot standbys for each server that handles a given bucket in your URL hashing, so if one goes down you have another ready. As long as they are in different data centers, the risk should be low. Or you could do some hybrid thing between the two ideas.
INTERVIEWER: Good thinking. We don’t really have time to get into the details of that, but you’re on the right track.
One thing we didn’t talk about is the problem of duplicate content. How would you avoid processing copies of the same page that are under different URLs?
CANDIDATE: Hm, well, we could compute some sort of hash or checksum or something on the content after it’s downloaded. Then store every hash value we’ve encountered somewhere. So, before we move from the downloader to the URL extractor, we see if that page’s hash value has been seen before. If not, we add it and move on. If so, we’d have to compare the two pages character by character to ensure it’s not just some random hash collision and they really are identical – so we’d also have to store the URL the hash value came from so we can retrieve it if need be.
We have a similar problem with duplicate URLs, don’t we? If many pages are linked to the same URL, we don’t want to crawl that URL every time it’s linked. Only once will do, right? So let’s also keep a database – distributed, of course – of URLs we’ve already processed in this run. The URL filter will also check against that to ensure we haven’t already submitted that URL to the crawler. Or maybe we could do something clever in the URL queue to ensure we don’t queue the same URL twice. That could include a hash map in addition to the queue to let us check against URLs that have been processed already. But that’s another big distributed system to bolt on there when an off-the-shelf NoSQL database sort of thing would also fit the bill.
INTERVIEWER: Another thing we didn’t talk about yet is how to avoid bringing sites down by crawling them too fast. A lot of web servers can’t keep up with us if we just hit them with a request for every page on the site all at once. How would you deal with that?
CANDIDATE: Well, some sort of time delay has to be baked in between calls to any given site.
INTERVIEWER: Right, how would you do that?
CANDIDATE: We didn’t really go into detail on the “page downloader” block there, so let’s think that one through. Obviously, that’s going to be running on a huge fleet of servers, each running a bunch of threads to download pages, hash them, and store them. So maybe we hash URL’s to download to individual servers like we did for the queue. And we do this hashed on the domain name, so all the download requests for a given site end up on the same server. That server could then maintain a thread for each site that runs in parallel with the other sites it’s taking care of, with a time delay between each hit on a given site. This is all starting to seem a little overly complex. Maybe this whole thing could be combined with the queue somehow, so we don’t need two different systems. I don’t think we have time to go back and revisit that, though.
INTERVIEWER: No, not really. But you’re right; it is possible to just bake this logic into the queue. Then the page downloader, as you’re calling it, just has some fixed number of download threads, with a time delay between each hit, that the queue feeds requests into. The queue just makes sure requests from the same site end up in the same download thread. I like that you’re aiming for simplicity.
CANDIDATE: Wow, this is all more challenging than it seems at first.
INTERVIEWER: I know! That’s why it’s a good interview question. Let’s go back to your high-level design real quick. So real quick, we did talk about extending this system to store images or do client-side rendering. Where would that fit in potentially?
CANDIDATE: Well, we could extract images at the same time we do URL extraction. But really, we could just treat them like another URL to be crawled, that way, we benefit from all the other pieces of the system. So the “page downloader” just knows how to recognize an image URL, and how to retrieve and store images as well as HTML.
INTERVIEWER: And client-side rendering?
CANDIDATE: I think that would have to go into the URL extraction piece. So instead of just scanning HTML for URLs, we actually render the HTML in a browser and see if any new URLs are created in the process. That means building out a whole other fleet of page renderers and a way to queue them up. Wow, this gets really complicated really fast.
INTERVIEWER: That’s why Google is as big as it is. We didn’t even talk about dynamic content or sites that require you to log in, or malicious sites that try to trap crawlers in an infinite loop. There are all sorts of interesting edge cases. But you’ve done a good job of thinking through this problem in the time we have; let’s move on.
Now that you’ve walked through several mock interview questions, go practice on your own. How would you answer these questions?
Keep in mind the structure here, and the importance of asking clarifying questions, and explaining your thought process out loud so your interviewer can understand how you think and process information.
In this article, we will be sharing some basic DO’S and DON’TS for the System’s Design Interview. These basic tips will help you master your interview and impress your future employer.
What They Want:
What are hiring managers really looking for in terms of perseverance?
Let’s dive a little deeper into that:
What they want to see is evidence of independent thought.
Can you research solutions to new problems on your own?
Can you invent things?
Can you come up with new, novel solutions to new problems that nobody’s ever seen before?
Have stories ready to go to prove that.
2. Can you learn things independently?
There is nothing more annoying than a new hire who demands help on everything that they’re asked to do when they could just look it up on the internet. If you are faced with a new technology to learn, can you just go learn it on your own? Have some evidence of a time you were faced with having to learn a new technology, and you just dove in and learned how to do it. As well as, how you applied that knowledge to build something. That’s what future employers want to see.
3. Demonstrate grit:
“Never give up, never surrender,” to quote Galaxy Quest. Do you have the grit to see challenging problems through to completion? Employers love to hear stories about how you were faced with learning a new technology and solving a new problem. And not only did you learn it, but you applied it, and you deployed a system that worked and solved a real-world problem. Stories like that will be especially powerful.
4. Are you self-motivated?
You should not have to be told that you cannot just spend your whole day watching cat videos because your boss didn’t give you specific instructions of what to do today.
If you don’t have specific instructions for what to do today, you should be asking your project manager or your manager, “what should I be doing today?” And if they don’t have an answer, then you should come up with something new to do on your own that will bring value to the company. Experiment with some new idea you have and make it happen and see how it performs. Those are great stories to have: stories of pure initiative, where you had an idea of your own, and in your own spare time, you made a prototype and experimented with it to see how it worked.
It would be a really happy ending if that thing made it to production in the end, but it doesn’t have to. Just the story of self-motivation, where you had some extra time on your hands and made the best possible use of that time, is powerful. Hiring managers love that sort of thing. If you have a story about that kind of individual initiative, find an excuse to talk about it, because it will really endear you to your future manager.
What They Don’t Want:
One thing people don’t want on their team is the guy who’s constantly burdening the rest of their team with simple questions that they could have answered on their own with a little research. If anyone ever told you, “Let me Google that for you,” you have a problem. You can’t be someone who’s constantly leaning on others for basic guidance.
I run into a ton of people as students who are looking for recipes, step-by-step instructions, hand-holding, and explicit guidance on how to solve every problem they’re given. Don’t be that guy. That is not the kind of person that these big tech companies want to hire. They want people who will have the determination and perseverance to find those solutions on their own. If the answers you need are on the internet somewhere, you need to go find them yourself and not burden the rest of your team with finding it for you. You need to be as self-reliant as possible.
If you’re being asked to design some big new system, of course, you should be collaborating with your team on that. But for the simple stuff, look it up on your own.
If you’re the kind of person who can’t accomplish anything without a step-by-step recipe, you need to work on that. It’s a sign of experience that you don’t need recipes to get things done, that you can put things together on your own, and can assemble different technology components to create new things.
So, don’t talk about a time when you had step-by-step instructions to do something. Talk about times when you figured it out yourself.
Hiring managers also watch out for people that have a failure to focus.
You must appreciate that the work you do has zero value until it’s in front of customers. That understanding can provide a strong drive to get stuff done. Have stories ready of where something you built made it all the way into production, and you played a role in pushing it out the door and making sure that it had a real impact on the business.
If you spend the whole year just doing R&D and trying out cool new ideas because you think it’s fun technology when you were supposed to be building customer-facing systems, good for you, but that does your company no good. That does your manager no good. In fact, it does some harm because they’re wasting resources that could have been better spent.
Have stories prepared that show you have a focus on the result, and you realize that you need to work hard to get something out the door. And until it’s out the door, it has zero value to the business. Those are good things to talk about and demonstrate in your interviews.
Hopefully, this insight gave you some technical knowledge on what your interviewers are looking for in your interview. To learn more strategies like these on how to Master your System Design Interview, we’d like to invite you to check out our Mastering the Systems Design Interview Course.
In this course, Frank Kane, previous hiring manager at Amazon headquarters, shares a behind-the-scenes look at what interviewers are looking for, and how you can stand out from the crowd. So you can land your dream job.
Have you been asking yourself any of these questions lately?
Why do our technical projects keep slipping?
Why are the engineers I work with annoyed when I try to talk to them?
Why are they resistant to coming back into the office?
Why can’t they appreciate the strategic importance of what we’re building?
If you answered yes – consider that the problem might not be with your engineers but with how you communicate with them. Managers, project managers, or anyone who depends on technical teams need to understand how engineers think differently – and how to communicate with engineers to maximize their productivity and their morale.
This course is taught by Frank Kane, who brings his experience as both a senior manager and a senior engineer at Amazon headquarters. Frank’s seen the challenges of communication between engineers and non-engineers from both sides of the table and shares his insights on how to empathize with engineers to communicate more effectively. You’ll join 700,000 learners who have gained technical and managerial skills from Frank.
Better communication with engineering leads to more realistic project schedules, a more productive team, and an assurance your team is building the right thing. Some specific topics we’ll cover include:
Introversion vs. extraversion, and how to create an environment conducive to both
Communication challenges arising from a focus on the big picture vs. a focus on technical details
Optimizing your communication style to keep engineers productive
Soft skills vs. hard skills, and the communication challenges that arise
Navigating cultural, language, and geographic barriers
You’ll also get four hands-on activities, including a role-playing exercise of a difficult meeting with a lead engineer. You’ll get to practice and apply what you learn.
This course is aimed at non-technical staff that depends on engineering teams to deliver results – managers, project managers, or anyone else on the business side. Understanding what makes engineers tick goes a long way in building a more productive working relationship with them.
Mastering the System’s Design Interview: Asking the Right Questions (Part 1)
By Frank Kane
This article is an excerpt from our Mastering the System’s Design Interview course. In this section, we will dive into the key questions you should ask during your interview to help you stand out from the crowd.
When it comes to an interview, your ability to navigate the interview is just as important as your skills and credentials.
Your potential employer is not only evaluating your technical expertise but also how you approach problems in general and how you work with others. That is equally important to the technology that you’re invoking.
So, let’s discuss strategies for getting through the interview itself successfully. These are more about soft skills and how you approach problems than the actual technologies themselves. As an interviewer at Amazon, where I did thousands of these interviews, this is what I’m looking for more than anything else.
I want you to come up with a solution that makes sense, but it’s less important to me that you finish and describe a fully fleshed-out system that scales. I’m more interested in whether you can take feedback from me.
How do you respond to my direction?
How well do you work with me?
How do you think?
How do you approach problems?
Usually, in a system design interview question, you’re going to be given some incredibly vague problem to solve – and that is intentional. The interviewer wants you to break that down and clarify the requirements of what you’re asking them to build.
I might give you some incredibly high-level things, like “design YouTube for me”, or “design Google search please”. Some people will just sit there and look like a deer in headlights and quietly cry to themselves a little bit (they don’t really, at least I hope not).
The successful interviewee will say, “Okay, let’s break that down and see what you really want me to do.” The first step is to turn this vague direction they’ve given you into concrete requirements.
Always start by repeating the question. That’s just a basic communication skill. If someone’s asking you to do something, just to repeat it back to them and make sure that you understand it properly.
So, if I say, “design YouTube”, the first thing you should say is, “okay, you want me to design YouTube, you know, the big video streaming service” and make sure you’re on the same page.
2. Now, it’s time to ask lots of questions.
Break that down into what they really want you to build.
Clarify what the requirements for it are.
Think out loud.
3. As you approach a problem, don’t just sit there in silence. That’s not doing you any good. Let your interviewer see your thought process.
As you’re thinking about different strategies and the pros and cons of them, don’t keep it inside your head where the interviewer can’t see it. Talk about it as you’re thinking through it. That will give the interviewer a chance to steer you in the right direction, which is going to help you.
Learn more strategies on how to Master your System Design Interview today with our Mastering the Systems Design Interview Course. Click here to enroll today.
Stay tuned for part two of this series, coming soon.
Ace your System Interview with these Strategies – Part 2
By Frank Kane
This article is an excerpt from our course, Mastering the Systems Design Interview. As part 2 of our series, this article will share strategies & skills to help you master your Systems Design Interview. In part 1, we went through how to ask the right questions to stand out in the interview. Click here to review part one.
Now let’s dive in. First and foremost, you want to let your interviewer see your thought process. As you’re thinking about different strategies and the pros and cons of them, don’t keep it inside your head where the interviewer can’t see it. Talk about it as you’re thinking through it. That will give the interviewer a chance to steer you in the right direction, which is only going to help you.
Start by repeating the question and then break it down into specific requirements.
If I say “design Youtube,” you should say, “Okay, you want me to design Youtube. What part of YouTube do you want me to design? There are many components to YouTube, like recommendations, editing content, channels, advertisements, and managing payments to people.
What piece do you want me to focus on?
Is it just the storing and the serving of the videos themselves?”
Odds are they’ll say, “yeah, let’s just focus on that” because obviously, you’re not going to design all of YouTube in 20 minutes.
So, narrow the problem down to what they really want you to do. Then ask more clarifying questions. How many videos are we talking about? How much traffic are we talking about?
They might push back on you and say, “well, what do you think it is?” You might have to estimate that yourself. But just make sure you come to an agreement about what the scale of the problem really is, and what the requirements are in terms of latencies or availability. Those requirements will inform what tradeoffs you can make in your design.
You might also ask about the budget. Do I have infinite money and infinite servers to throw at this? Or do I have cost constraints to think about as well? Clarify those requirements upfront. As you start to think about what the implications of those requirements are, talk about it.
I’ve seen one other system design interview prep course, and the instructor’s advice
was to start off by saying, “okay, can I think about this for five minutes?” I think that’s a terrible idea. Don’t just sit there in silence for five minutes. It’s awkward, and the interviewer doesn’t get any insight into your thought process, which is what they really want to see. So, no, do not just sit there in silence for five minutes while you think about it. Think out loud, and let the interviewer see how you approach the problem.
Once you understand what those requirements are, a good idea is to work backward from the customer experience. It’s very tempting to start with your favorite technology and say, “oh, okay, I want to use Apache Spark and a deep neural network for this problem,” and work forwards from your favorite technologies in hopes that you end up with a solution that meets the customer requirements.
A better approach is to work backward from those customer requirements and figure out what technologies meet those requirements. For example, say we’re asked to design YouTube. You need to vend massive amounts of videos all around the world at very low latency and at a massive scale. That probably means you need to use a CDN for the most popular videos at least. Where does the data come from that feeds that CDN, and how is that data cataloged? How do I access it? Work backward from the end user to the CDN, to the distributed data store that feeds that CDN, to the systems that populate that data store. That’s working backward from the customer, as opposed to working forwards from the lowest level technology and hoping you arrive at the right place.
Working backward will gain you major points at Amazon in particular because it is a very core piece of their entire culture. You are routinely expected and evaluated on how well you work backward from the customer at Amazon. Simply saying during an interview, “I’m going to work backward from the customer experience here,” will earn you a ton of goodwill at an Amazon system design interview. But even outside of Amazon, working backward is the right way to do it. Most experienced interviewers will appreciate this approach.
Tying it back to our YouTube example again, we might start off by asking clarifying questions. How will users discover the videos? Do we need to think about building a search engine or a recommender engine or an advertising engine? Figure out what the customer experience is that you’re being asked to deliver. What piece of that experience are you being asked to design? Start by understanding what customer experience you’re being asked to deliver, and then work backward from that customer experience to the technical components that you need to deliver it.
You must first identify who the customers are.
Are these people all around the world?
Are they in one specific region primarily?
Are they accessing your data at weird times, or are there weird peaks in traffic that we need to worry about?
What are their use cases?
What are they trying to do?
In our YouTube example, they’re trying to find a video and play it back in a reliable manner. They don’t want to see any buffering.
Which use cases do you need to concern yourself with? Don’t paint yourself into a corner where you’re trying to design all of YouTube in 20 minutes or all of Google or all of Amazon. Make sure you define upfront what piece of it they really want you to do.
Your first task in any system design interview is to clarify the requirements of what you’re being asked to design. Clarify them in terms of the customer experience whenever possible. Your interviewer is really trying to see that you can think about problems from a business perspective and not just a technical perspective. If you start with the customer experience and focus on that, that shows me that you can think about this like a business owner would. You’re thinking about how this technology will provide an experience to this company’s customers that can make them money down the road. Your job is not just about applying cool technology. It’s about finding solutions for delivering new compelling customer experiences.
Once you’ve defined the customer experience you’re trying to deliver, what do you ask about next?
You need to define the scaling requirements of the system you’re being asked to deliver. Nail down what the scale of the system really is. Are we talking about hundreds of users or millions of users? That makes a big difference as to what sort of technologies you will invoke. And it will inform you of the need for horizontal partitioning. If the scale is massive, then you know you need some sort of horizontally scaled solution. You can’t just throw one giant database at it or one giant web server. You need some sort of distributed system. And if it’s truly massive, maybe you need to rely on cloud storage solutions like S3, where somebody else is managing all that cost and complexity for us.
Also, understand how often the users are coming. What kind of transaction rates are we talking about that we need to support? Again, if it’s large rates of transactions, you will need many servers at the front end to manage those transactions. Think about how those transactions are distributed across different data centers and across the world.
Also, define the scale of the data. That data must come from somewhere, and it probably needs to be processed in some way. Are we talking about hundreds of videos, millions of videos, or billions of videos? If it’s a really big number, we need some sort of distributed system for processing and storing that data. You will need to employ every trick in the book for horizontally scaling your servers, your data storage, and your data processing.
But not every problem warrants that sort of approach. If you’re being asked to design some internal tool only used by some department within your company, you don’t need that level of complexity and expense. You should always prefer the simplest solution that will meet the requirements. But generally, if you’re being asked to design a system for a large company, they’re asking you to design something at a massive scale. Often, the answer will be some sort of horizontally scaled solution, but at least articulate that, ‘Hey, this is massive scale. I need to do horizontal scaling’.
In the off chance that you’re being asked to design something smaller, go for simplicity and explicitly say that’s what you’re doing. Say, “hey, I know we could throw a big Hadoop cluster at this problem or some big distributed Elasticsearch cluster, but I don’t need to. I don’t want to have to maintain all those services if I don’t need to. And I don’t want to be paying a big web services bill either, so why take on that complexity where I can just have a database and a backup database and a few web servers in front of it and be done with it? ”
Learn more strategies on how to Master your System Design Interview today with our Mastering the Systems Design Interview Course.
In this course, you will get tips, tricks, and practice interviews with a former hiring manager from Amazon, who interviewed thousands of software engineers and hired hundreds.
Frank Kane will share the secrets of what your interviewer is looking for and the technologies you’re expected to know. And we all know practice makes perfect, so you’ll also get six mock system design interviews with real-world interview questions from the biggest tech employers.
A technical interview loop is a demanding process, and the system design part is often the most challenging. This course gets you prepared, and maximizes your odds of landing a new job that could change your life.
This course includes 5 hours of on-demand video content that will cover everything you need to know before starting your next interview:
Techniques for scaling distributed systems and service fleets
Database technologies and “NoSQL” solutions
Use of caching to improve scalability and performance
Designing for resiliency and handling failures
Distributed storage solutions
A review of algorithms and data structures
Processing big data with Apache Spark
An overview of cloud computing resources
Interview strategies for structuring your system design interview
Six full mock interviews with real-world system design interview questions
General tips and tricks for a successful technical interview
This course is for experienced software engineers who need some extra preparation prior to a challenging system design interview.
This course is currently available at sundog-education.com and also will be available on Udemy soon…but you’ll only be able to get it at its BESTprice of $9.99 at sundog-education.com (now through the end of the month)!
Enroll today, and you’ll have every advantage going into your next tech interview!
This week we’ve rounded up some of the latest news in big data and machine learning for you. Including the difference between mathematical optimization and machine learning, and man-machine collaboration.
As an instructor on Udemy, I’m always looking for new ways to help my students learn about big data and machine learning. If you’re interested, check out my latest post on Udemy’s blog: Machine Learning Engineer vs Data Scientist – https://bit.ly/399v2PC