This is continuation of a blog about common mistakes when building serverless systems on AWS. It is split into the following parts:
- Coding
- Architecture (here)
- Performance, scalability, optimization, and monitoring
- Costs
- Communication with other systems
- Security
- Unexpected Behaviors
Serverless differs from the traditional system in architecture. Functions are the glue between other services. They are triggered by events and, in essence, this is event-based programming. The concentration is not so much about one data store but more about flow of data that moves between services. Design patterns in a lot of cases are different than traditional ones — they are more microservices oriented.
You can read about the principal of serverless design patterns here.
Do One Thing Only
You should keep your function small and do one thing only in each function. Serverless solutions should be built as microservices — small and autonomous. That means you can change, optimize, scale, move, or transform each microservice autonomously. The purpose of splitting to microservices or nanoservices is to get small independent units that can be individually managed, monitored, and scaled. See more about code granularity here.
Design for Failure
As in any system, errors can occur. This is especially true for distributed systems such as serverless, where you depend on a lot of pieces. There can be server crashes, resource limits, throttling, third-party issues, network outages, versioning conflicts, or even just bad code. You should take these errors into account when you are designing a system. That does not mean serverless services are not reliable.
When designing serverless system, desing for failure. That does not mean serverless is not reliable.
You should gracefully handle errors and retry tasks if necessary. In some cases, the system retries them automatically. And here we come to a very important term — idempotent. Idempotent means that you can run the same task and get the same result. Also, multiple identical requests should have the same effect as a single request. Repeating of the task must not have side effects. The task is repeated because it did not complete or because of concurrency or other reasons. For example, repeated processing of the order must not result in the shipping of two products. This means you have to perform checks at the beginning of processing and prior to outputting results.
You should take special care of accidental recursive calls and retry, which can cause an uncontrolled rise in costs and overload your system, especially the downstream system that you connect to.
Lambda and Blocking Execution
In Lambda, only one request is processed at a time. In Node.js, most of the API calls are asynchronous, which means they do not block thread. But because only one request is processed in Lambda at a time, you mostly do not benefit from this great Node.js feature unless you do multiple simultaneous calls to other services. The plus side is that you can occupy the thread with processing as much as you want and that will not influence your Nodej.js application like in traditional environments where one thread is shared by multiple simultaneous requests.
Stateless Functions Logic
Containers that runs functions come and go depending on current needs. You cannot hold state so no state-oriented authentication or other solutions will work. You can hold some data, for example a cache or a database connection, while the container is alive, but you will not be notified when it goes away. And you will never know how many containers are currently running.
The fact that a container stays alive between calls has a side effect. Any overflows, leaks, or similar mistakes we do can impact further Lambda invocations.
Dead Letter Queue
Lambda can be invoked synchronous or asynchronous (do not confuse with Node.js asynchronous calls). To simplify, an example of synchronous would be a web request where you get results straightaway. Asynchronous would be triggering on notification (SNS).
An asynchronously invoked Lambda function will be retried twice. To catch errors, create a dead letter queue (DLQ) so you will know when your function fails. With a DLQ, you can send an unprocessed event to the SQS queue or SNS topic, where you can take further action or at least save your invocation event information. You can monitor it via an SQS queue length metric and set an alarm. With SNS, you can send it to other Lambdas, even in another region.
If you are using Amazon SQS as an event source, configure a DLQ on the Amazon SQS queue itself and not the Lambda function. For more information, see Using AWS Lambda with Amazon SQS.
Asynchronously invoked Lambda - use dead letter queue
Function Calling Other Functions or Service
Calling functions from another function is valid and preferred in some patterns such as fan-out, but in other cases it is redundant and you are just paying double for the function — once for calling the function that is waiting and another for the function that is executing. You will make debugging more complex and remove the value of the isolation of your functions.
If you have an asynchronous process with calling several functions and services, try to use AWS step functions or push data to a DynamoDB, a queue (SQS), Kinesis Data Stream, or trigger an SNS notification. All those can produce events that can trigger other Lambda to do additional asynchronous processing.
Web Sockets
Web sockets are used for real-time data connection with a server. For example, they are used to build chat applications and any other application that requires instant updates from a server. Until recently, not being able to use a web socket was one of the stoppers for serverless systems. So, what is the problem? No permanent container running a function means no permanent connection. No permanent connection means no web socket. At Re:Invent 2018, AWS introduced WebSocket APIs. WebSocket APIs is supported by API Gateway that maintains live web socket connection. Lambda is called when a message is received. You can send a message from Lambda by calling a specially created endpoint. See more here and here.
Before WebSocket APIs, AppSync was the best solution for real-time data connection. AppSync is a fully managed serverless GraphQL service for real-time data queries, synchronization, communications, and offline programming features. It is an AWS response to Google Firebase with a slightly different approach.
Database Choice
This is a big never-ending fight. Let's start with the facts. SQL databases are by design mostly vertical scalable (bigger machine), but NoSQL databases (DynamoDB, DocumentDB, MongoDB) are primary horizontal scalable (lots of small machines) and can basically scale to infinitive for a low cost. SQL databases are not ultimately scalable by design. In SQL databases for horizontal scaling you only have sharding and read from replica. In sharding, you split a database to a smaller database and in reading from replica, you read data from the database copy, which is always lagging in time behind the main database. Both approaches demand adjusting the application and have, besides cost, a lot of other negative sides. But on the other hand, SQL databases have lots of features and a really powerful query language SQL. Knowledge about SQL databases are also widespread. The connection to an SQL database in a serverless environment is an issue as you read it in a section Communication with other systems.
If you want an ultimate scalable system and/or low price then go with NoSQL databases. But be prepared for a rocky road if you are used to SQL as the concepts are much different. Think about how scalable you need to be! An SQL database with its power is a commodity that is hard to give up.
By definition, if you are building a serverless system on AWS then DynamoDB should be your first choice. Mainly because of the hustle you have with connecting to an SQL database. Are you confused? Well you can combine both solutions — DynamoDB for fast scalable data access and an SQL database where you need powerful query language. You can set up a data stream on DynamoDB to transfer data to an SQL database on every insert. When starting with DynamoDB, learn best practices.
If you are building a serverless system on AWS then DynamoDB should be your first choice
If you decide to go with SQL and you do not want to manage it yourself then you have several options in AWS. You can choose between Aurora and Amazon Relational Database Service (RDS).
- Amazon Aurora is a MySQL and PostgreSQL-compatible proprietary relational database built for the cloud. It is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases. It should be the number one choice for SQL in AWS. It also offers Aurora Serverless which is an on-demand, auto-scaling configuration of Aurora. The serverless version is in its early stages and has consequently few shortcomings.
- Amazon RDS supports the databases MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server that are managed by AWS.
Learn When Not to Use Serverless
You can build serverless in almost every system. And with evolution, it is slowly taking care of some corner cases. But like any solution, serverless is not suitable for everything. Applications that demand ultra-low latency and applications that do not fit in Lambda hardware limitations could be a problem for serverless. Lambdas have a 15-minutes timeout, so no long processing that cannot be split to smaller chunks. There is also a 3GB memory limit and 512MB disk space. See more about when not to use serverless here.