All of us are familiar with coding and aptitude rounds in interviews, but in many companies, there is another part called the system design round. Have you ever heard of it?
In most companies, engineers usually work on smaller projects and do not know about designing large systems. On the other hand, the FAANG companies require their engineers to be able to create large systems.
We all dream of working in FAANG, so we should be as well-versed with system design as we are with other topics.
Now the question arises, what is system design?
To understand this, let us divide the word into two parts, system and design. In simple terminology, a system is a component where on giving an input, we get a separate output and, design means a plan which shows the working of some function.
On clubbing these two parts together, we get,
System design involves developing modules, architecture components, interfaces and data that provides the solution to meet the requirements of an organisation.
After knowing this, you may be wondering what kind of systems are asked to be designed in an interview.
You can find a list of popular system design interview questions here.
Now, in this article, we will learn about the URL Shortening Service in particular.
Requirements to Design a URL Shortening Service
To begin with, we should know what our requirements are while making a URL shortening service.
- Our first and foremost requirement is to generate a shorter and unique version of the long URL given to us. The new URL must be short since that’s the main motive of our system, and it must be unique so that it leads us to our desired web page and not some other one.
An example of this would be, shortening this link:
- To understand our next requirement, try clicking on both the links given above.
Don’t they lead you to the same webpage?
That’s the next requirement of our URL shortening service, i.e., the shortened link must redirect us to the original link.
- Now, when you clicked both the links, it shouldn’t have taken about the same amount of time to open both the original link and that shortened link. This explains our third requirement of redirection in real-time.
- For our next requirement, let us look at the original and shortened links again.
It is evident from the original link what the webpage is about, but is it so for the shortened link too?
With this realisation, we come to our fourth requirement that shortened links should not be predictable.
- Another critical requirement for a URL shortening service is that the system should be available almost all the time. If it isn’t, then all the URL redirections will fail.
- Some other requirements are:
- Users should be provided with the choice to pick a custom short link.
- The shortened link should expire after a default time specified by the user.
- Analytics such as the number of redirections should be kept track of.
- The URL shortening service should be accessible to other services through REST APIs.
Capacity Estimation and Constraints in URL Shortening Service
Before designing a system, we need to estimate parameters like the bandwidth required for redirections or the total memory needed to store the shortened links. Let us make those estimations first.
In all the estimations given below, we will consider a ratio called the read-write ratio.
This ratio provides us with a comparison between the number of redirections and the number of URLs shortened.
In our example, let us consider it to be 100:1.
To calculate the traffic estimate, we will consider that there will be 10 million new URL shortening each month. Therefore, by the read-write ratio, the number of redirections per month will be,
100 x (10 x 106) = 109 = 1 billion
Next, we will calculate the queries per second (QPS) or, in simpler terms, the number of new URL shortenings per second,
(10 x 106) / (30 days x 24 hours x 3600 s) ≈ 4 URLs / s
From the read-write ratio, the number of redirections per second will then be,
100 x 4 = 400 / s
Storage is an essential factor while designing a URL shortener algorithm.
To estimate the storage requirements, let us consider that the time after which a shortened URL expires is five years and that each stored object needs 500 bytes.
In the beginning, we had already assumed that 10 million new URLs are generated every month. So the total number of objects that must be stored in 5 years is,
(10 x 106) x 5 years x 12 months = 6 x 108 = 0.6 billion
From here, we can calculate the total storage requirements for five years to be,
0.6 x 109 x 500 bytes = 3 x 1011 = 300 GB
Since we already estimated the storage requirements, you may be wondering what memory estimates are.
When the traffic in a particular URL is very high, they are sent to the cache memory (since cache memory is faster) following the 80-20 rule. According to this rule, 20% of the URLs generate 80% of the traffic, so 20% are cached. This describes the memory estimates.
Consider our previous values, the number of redirections per day will be,
400 x 24 hours x 3600 s = 34.56 million
Therefore the cache memory required will be,
20 % of (34.56 x 106 x 500 bytes) = 86.4 GB
This amount is just an estimate, and the actual amount of memory required will be less than this. The reason for that is that there are many duplicate requests for the same URL.
The bandwidth requirements are also a vital factor owing to its limitations.
For write requests, we estimated the QPS to be 4 URLs / s. Thus, the total incoming data will be,
4 x 500 bytes = 2000 = 2 KB/s
For read requests, the number of redirections per second is 400. Thus, the total outgoing data will be,
400 x 500 bytes = 200000 = 0.2 MB/s
To summarise all our estimates,
|Incoming data||2 KB/s|
|Outgoing data||0.2 MB/s|
|Storage for 5 years||300 GB|
|Memory for cache||86.4 GB|
We can get a rough idea of the amount of data we are dealing with from these estimates. We can choose the database and API (with which we’ll fetch our data) for our URL shortening service with that information.
RDBMS (relational database management system) is used for small data sets (since it runs on vertical scaling), while NoSQL is used for larger data sets (since it runs on horizontal scaling).
Technique to Design a URL Shortening Service
Now that we know all the extra details we need to know about URL shortening let us walk through the process.
Writing an API
API is abbreviated as APPLICATION PROGRAMMING INTERFACE. When we always take the word API it sounds complex but in reality, it’s very easy to use and understand!. API is like a common language or interface used by servers to communicate with each other.
You utilise an API every time you use an app like Facebook, send an instant message, or check the weather on your phone.
In simple terms, an API is a counterpart in databases to functions in our regular programs.
We will be using rest API due to its advanced security options and ability to provide a scalable URL shortener.
But what is it doing?
An API, as already mentioned, works like a function. In a function, we pass arguments that get processed to return some result. Similarly, an API requires API parameters to send a request and get back a response.
The general syntax to send a request using API is:
API_name(API_parameter1, API_parameter2, API_parameter3)
In Storage Estimation we already read that we choose a database according to the amount of data we are dealing with. Once that is decided, we need to chalk out the database schema.
|Primary Key||Hash: varchar(16 )||User ID: int|
|Original URL : varchar(512)||Password : varchar(32)|
|CreationDate : datetime||Date of Birth: DateTime|
|Expiration date: DateTime||Email id: varchar(64)|
|Username : varchar(32)||Username : varchar(32)|
|User ID: int|
Here, the user column specifies the data in the database, and the URL column specifies the data in the shortened URL. The primary key (here, user ID) is used to map (using a hash table) the data in the database to the shortened URL.
Almost all of us have made a typo while typing the URL, at some time or the other, which leads us to a completely different website.
That’s how sensitive URLs are.
Keeping this fact in mind while designing a URL shortening service is crucial, and that’s why we need encoding. But what is encoding?
In straightforward terms, let’s consider that every capital letter of the alphabet is represented by the numbers from 0 to 25, respectively as follows:
Therefore, the word “CODINGNINJAS” will be encoded as “21438136138139018”.
This is a very simple way to encode data that is easy to decode, so it isn’t used to encode URLs.
In reality, two standard encoding techniques for URLs are:
- MD5 (Message-Digest Algorithm) –
Here, a 128-bit hash value is generated as an encoded value. Due to its vulnerability, it is not preferred. You can try using it here.
- SHA256 (Secure Hash Algorithm) –
Here, a string of fixed-length is generated as an encoded value. You can try using it here.
Now, the number of possible combinations while encoding is not infinite, so it may happen that after encoding, two different URLs turn out to be the same. To avoid this problem, we can encode the basic URL common to all and then append the user id (different for every user) at the end.
Data Partitioning and Replication
While shortening the URL of a website, we need to make sure that all the data and files associated with the original URL are linked with the shortened URL. To ensure this, the database is partitioned (in simple terms, the data is organised) based on its necessity. There are two types of partitioning:
- Range-based partitioning
Here the data is sorted by a fixed parameter, for example, lexicographically using the file name.
As a practical example, consider a class of students. You want to divide them into groups, but you’re separating them by the first letter of their name. Now, some letters like “A” and “S” are very common, while names with “Q” or “Z” are not that common. So, here the data is unevenly distributed.
With this, we can understand the drawback of range-based partitioning and the need for our next method – hashing.
Hashing involves assigning keys to a particular value. So here, every data is identified by a key, and then those keys are randomly assigned to the partitions. This removes our previously encountered problem of non-uniform partitioning.
The primary use of caching is speeding up the time needed to load a particular website. So, if a specific URL is viral and shortened, then the data is fetched from the database using the shortened URL and stored in the cache memory.
LRU policy is used to delete data from the cache memory when it is full.
As the name suggests, a load balancer is used to maintain the load balance and prevent overloading. Now, overloading may occur in three possible places in a URL shortening service:
(i) Between the client and application servers
(ii) Between the application server and database server
(iii) Between the application server and cache server
Database Cleanup (Purging)
Earlier, the size of databases was limited. Also, the different encoding combinations were limited, so they had to be freed for maximum utilisation. So, data was provided with an expiry date, after which it was deleted from the database. If such a URL has a shortened version and someone tries to open it after its expiry, an error message will be displayed.
Nowadays, the size of databases is not that limited and advanced encoding techniques also provide innumerable combinations. Because of this reason, most data is left unused instead of being deleted.
In English, telemetry means “recording and transmitting the readings of an instrument”.
In a URL shortening service, it means the same. To understand this, let’s see an example.
Facebook often releases data about the number of its users from around the world. India proudly holds the first position, but how does Facebook find that out?
When a request is made to the database to fetch some data, it can track from where the request is made. A count of the number of requests from a country measures the number of users from there.
This is how Facebook releases such data, and it is called telemetry.
Every database has different sections with different accessibilities.
For example, all the employees at Google have a Gmail account for communication, but we also use Gmail. The user interface for both of us will be different since a Google employee will get admin access which we don’t get.
Now both a Google employee and an average user can shorten the link to access their email. So, if we use the shortened link of the Google employee, we should get access as an admin too, but that doesn’t happen.
Every user has a separate User ID which gives them access to specific data in the database. When a URL is shortened, this User ID is encoded within the new URL. So, when we try to open the admin page with our User ID, it doesn’t open because we haven’t been permitted to access the data.
This is how security is provided in a URL shortening service.
Frequently Asked Questions
A free URL shortening service is bitly.
Bitly is a powerful (and popular) tool for shortening URLs. The free service lets you shorten links using the Bit.ly domain name
The basic URL, which remains the same for all the users, is first encoded, and then we append the user id (which is different for every user) at the end to ensure that two URLs never map to the same shortened URL.
To create a URL, we need the following:
A scheme like HTTP or HTTPS
Hostname (e.g., www.systemdesignurlshorteningservice.com)
Good URLs are short and to the point.
A URL is minimised by mapping the database of a long URL to a short one with proper encoding techniques.
After going through this article, we know how URL shortener works and its backend work. But we have learned only the theoretical concept.
To deepen our understanding further, we should work on a shortened URL project. You can find that and other popular system design problems on CodeStudio.
Apart from that, CodeStudio also provides a wide array of practice problems and interview experiences from people presently working in renowned companies like Amazon, Microsoft, Goldman Sachs and many more.
By: Neelakshi Lahiri