Notes with race conditions

Last year books and harmfull perfectionism

Fri, 22 Mar 2019 22:21:14 +0000

Good friends, good books, and a sleepy conscience: this is the ideal life. ― Mark Twain

The power of the book

Quite a while ago, when I was starting my professional career I decided to dig deeper into the LabVIEW programming language to make the project I was working on back then better. As a part of that process, I read 4 books. After that I became a go-to person on the subject inside the company, I answered many questions on programming forums and gain some reputation there, participated in a couple of online competitions and won one of them. That was the first time I was really hit by the power of a good book.

A couple of years ago I adopted a more consistent and systematic approach to my reading and I’m still happy about it:

Be consistent and read/listen 22-25 books a year
Make notes (3-10 sentences) on every book during the reading process and revise notes once in a while
Drop the book if I don’t like it.

Number 3 is not easy to do. If you like a lot to finish everything you start, then this may often apply to books as well. However, there are so many good books. Does it really worth to push till the end the book we don’t like at all and don’t find useful? Just to say: “I did it! I read it!” Definitely not. Being able to finish what we start is an amazing skill, but in this case, it’s an example of harmful perfectionism. So, I’m quite happy about the following rule #3.

The best from the last year

Back to the last year books… I won’t bother you with the long review of all 25 books, dear reader, but I want to say a couple of words about books from different categories, which I liked the most:

Authenticity (by David Posen) I would say this was the best book in 2018 for me. It is fun to read it, but at the same time it opens up the nature of the stress from many different angles.

How to Talk So Kids Will Listen & Listen So Kids Will Talk A big part of parenting is working on yourself. This book has many great advises and real-world examples. I have listened this book and it’s so good, that I plan to read it this year. And I can’t say I often go through the same book more than once.

The War of Art When I have downtime or when I feel lazy, I go and re-read the first chapter. It’s inspirational and simply amazing. I also liked a lot this quote: Someone once asked Somerset Maugham if he wrote on a schedule or only when struck by inspiration. “I write only when inspiration strikes,” he replied. “Fortunately it strikes every morning at nine o’clock sharp."

Born to Run Some history, some running theory, facts about the human body, biography of very respectable people unified around one common topic: running. I’m not even close to “serious runners”, but I do like to run 10-12km one-two times a week. This book is fun to read, contains a lot of useful information and it, definitely, improved the quality of my runs. Also, this book was recommended by a friend of mine, who was a running coach and who participated in 100km races, which speaks for itself.

The Go Programming Language The best book about the programming language for experienced programmers. Very short and detailed (yes, both, at the same time) explanation of important Golang idioms. It was my second Golang book, after working with Golang for a bit and reading multiple tutorials, but I would recommend this as a first Golang book for experienced software developers.

Hans-On Machine Learning with Scikit-Learn and TensorFlow There is a hype around machine learning for quite a while already. If you google something like “top N in-demand tech skills 2019”, machine learning will be in each of the lists. Very likely at the top. I see that some machine learning knowledge can be a great complementary skill to software engineering skills, especially if you work with large distributed systems and big data (these days we see ML everywhere we see big data, isn’t it?). This book is “developer friendly”, it’s very pragmatic, it has a good mix of theory and code (although it has quite a bit of math as well) and covers a lot of ML topics.

Fiction category and movies

I found that I read more professional, business and self-improvement books lately and watch a movie for a “fiction” content. I did enjoy fiction books last year as well, I just didn’t include them in my short list. I also noticed that when that rare “movie evening” comes (you know, two little kids…), I often have a hard time selecting a good movie to watch. I have never had such issues with books. My “want to read” list grows way faster when I “clean” it.

Redirects with AWS Route53, S3 and CloudFront

Thu, 17 Jan 2019 21:28:17 +0000

Recently I had to change domains and subdomains for a project I have worked on a couple of years ago. Usually redirects are simple, but, sometimes, not that simple many of us would like them to be. Changing DNS service records may not be enough. We may need to create an S3 bucket and a Cloud Front distribution.

So this will be a short blog post, which describes a couple of scenarios with domain redirects on AWS.

What do we have

Imagine, we have a project hosted on AWS, we have a static content which goes through S3 + CloudFront, we have some API implemented with AWS API Gateway + Lambda, maybe some microservices running on top of ELB/ALB plus EC2 or ECS. Basically, something similar to what I have described here. In addition to that, we have a relatively complex Route 53 hosted zone with 20+ records (A records, subdomains, email records, aliases, etc.).

Now, for whatever reason, we want to host part of the platform outside of AWS.

Redirect subdomain to an external hostname

Let’s start with something very simple first. Our domain is example.com and we want to have blog.example.com let’s say on Ghost. For this, we just need to create a CNAME blog.example.com with Alias ‘No’, which points to something like example.ghost.org.

Redirect root to www

If www.example.com is an alias to an internal AWS resource (not a CNAME like in the example above), then it’s quite easy. We can just create an alias for root A record which points to www.example.com. If www.* is a CNAME to an external resource, then it’s a bit more complicated.

Redirect root to an external hostname

What if we want our main domain example.com to be outside of AWS and keep AWS Route 53 as our main DNS service? Also, we want customers to always see our main domain in their browser.

Our www.example.com is a CNAME which points to a resource, which is outside of your AWS environment. Here is what we need to do to redirect root (as of today).

We need to create an S3 bucket with the name which matches our main domain name (example.com), then go to bucket properties and enable Static Website Hosting with a redirect option. We will be redirecting to www.example.com. After that, we can create A Record Set in Route53 for our root domain with an alias to our example.com S3 bucket. Still simple enough.

At this point, everything should be working except HTTPS https://example.com. To make https redirect work for the root domain we need to create a CloudFront distribution for our example.com bucket. After that, point our A Record on Route 53 to CloudFront distribution instead of the bucket and https will work. A couple of possible pitfalls in this step:

when we set up your CloudFront distribution, in the Origin Domain Name field we need to specify the Static Website Hosting endpoint (see the screenshot above) and not the bucket name: example.com.s3-website-us-east-1.amazonaws.com, not example.com.s3.amazonaws.com; this is very important
if you see an error like “This XML file does not appear to have any…" when you’re trying to access Cloud Front distribution link, you probably set an Origin Domain Name incorrectly; you have to fix that as described above and create an invalidation to update CloudFront cache
we don’t need to make our bucket public or create any bucket policies, because our distribution origin is not the bucket itself

So what we ended up with: root domain will redirect to www.example.com using CloudFront + S3. www.example.com is a CNAME record in Route 53, which points to whatever we need.

This is the simplest way I have found today to redirect your root domain if you don’t want to move all your DNS records from Route53 to another DNS.

Dealing with taken bucket names

If you can’t create a bucket with a name of your root domain, which shouldn’t happen often, you can create a bucket with a name like redirect.example.com. Do the steps from above for that bucket and then create an alias in Route53, which points your root domain to redirect.example.com record. We can create an alias from root to redirect.example.com record because the latter is an alias to an AWS resource (CloudFront distribution).

Conclusion

To paraphrase great Warren Buffett: “Redirects are simple, but not easy”. Hopefully, eventually AWS team will give us a way to achieve the above with one little Route 53 record and without S3 bucket and CloudFront distribution.

Five undervalued git commands

Tue, 19 Jun 2018 00:43:39 +0000

Today, git is the standard when it comes to version control system in software development and for many other uses.

Commands like git commit, checkout, pull, push, status are executed multiple times a day. However, git has a lot of more advanced features, which are not frequently used. Today I want to talk about five commands, which usually are not needed that often, but are quite useful at the right moment.

1. git bisect

Imagine a situation where something was recently working, but now it does not. Or, perhaps, all the tests were passing, but now one little test fails. It is not trivial to understand what happened. To simplify the investigation process we may want to find out the exact commit, which introduced the bug. git bisect is our friend.

It uses binary search between bad (current commit which is not working) and good (a commit from the past which was working) commits to find which commit introduced the bug.

1. git bisect start # let's start our workflow
2. git bisect bad   # current commit is broken, mark it a bad
3. git bisect good   # let bisect know which commit was working
# After step #3 bisect checks out a commit between good and bad
4. # we check if current commit has the issue or not
5a. git bisect good # if test at step #4 succeeded, we mark current commit a good
5b. git bisect bad  # if test at step #4 failed, we mark current commit as bad
# repeat #4 and #5 until we find the commit which introduced the bug

Instead of bad and good we can also use new and old.

2. git worktree

Let’s talk about different scenario. We have a large repository, which many people are working on simultaneously. It takes quite a bit of time to clone the repo and also some time to fetch all the changes regularly. We use git-flow and have a branch per feature.

Now imagine that, for some reason, we need to work on a different branch and keep our current one open at the same time. There are a couple of ways to achieve this. The most obvious one is to clone the same repo once again and checkout a different branch. However cloning takes time as our repo is quite big plus it creates a bit of mess/confusion as now we should have something like my-project and my-project2 in our projects directory. Also we will have to keep both folders in sync with the origin which means extra pull operations. This is where git worktree helps. Instead of cloning the same repo one more time we can do this:

1. git worktree add ../worktrees/myproject mybranch # create new working tree
2. git worktree list # list all working trees connected to our repo
... # do some work, then clean up
3. rm -r ../worktrees/myproject
4. git worktree prune

After step #1, we have two folders in our projects directory: myproject and worktrees/myproject. Both may point to the same branch or to two different branches, but they are connected to the same repository and we didn’t have run git clone second time.

3. git stash push -m “message”

stash command is much more popular than the two mentioned above. It is handy when there is a need to save unfinished changes somewhere without committing them. So we can do:

git stash # save local changes to stash entry
... # do something here, e.g. jump to a different branch
git stash list # print all available stash entries
git stash apply stash@{0} # apply changes from entry '0' without removing it from the list
git stash pop # apply changes from the last added entry and remove it from the list

What is very important when creating a stash entry in my opinion is command options. By default git stash associates a message from the latest commit with the newly stash entry. This may be good enough if we need our stash entry for a couple of minutes, but what if we need to create two stash entries one after another and keep them for a couple of days or a week? The output of git stash list will be very confusing.

So, it’s always better to use:

git stash push -m "changes description"
git stash save "changes description" # older deprecated alternative

stash push has a couple of other useful options:

-k, --keep-index will not stash changes which have been already added to the index
-u, --include-untracked will add untracked files to the stash

It’s a good idea to create two-three git aliases for different variations of git stash push.

4. git var -l

This command is different from the others, because it is educational. It prints out git logical and configuration variables. From its output we can find out:

preconfigured aliases and aliases configured by us
that our default editor is vim
what’s the current default push strategy
and much much more…

It’s a good educational tool. If needed a setting from git var -l output can be researched deeper and changed.

As per git var docs configuration variables listing functionality is deprecated in favor of git config -l. Also, in case you have noticed an alias, which you have never configured, in git var/config -l output. E.g.:

alias.ac=!git add . && git commit

Check out git config -l --show-origin output to find where does it come from.

5. git commit [-v, –amend –no-edit]

This command could be an exception from the list, as it is needed quite often and it is more widely known and used. But it’s useful enough that I want to include it in my list.

Sometimes, we need to review our changes before committing to come up with a better commit description. git commit -v will open a text editor and will display all the changes in addition to filenames to be committed. I have an alias git ci for this command.

git commit --amend is something I was really missing back in svn times. It is useful when we make a commit and realize that we forgot to mention something in the commit message or that it has a typo. --no-edit option can be added when we are happy with our commit message, but want to include a forgotten file into the last commit. I have aliases git ca and git can for these two commands.

Afterword

git is really powerful. It is worthwhile to take a quick look at the output of git help -a to get a very general idea of what it can do in addition to commands we all use every day and to check out periodically other people .gitconfig files on Github for new ideas on what else in our git workflow can be optimized and improved.

An ode to password managers

Thu, 31 May 2018 08:34:01 +0000

Do you remember all your passwords? If so, that's not good!

This post will not be as technical as my usual posts are and so it’s for everyone, as everyone uses a computer, phone or tablet and sooner or later needs to deal with passwords.

This post is about the importance of password managers. There are so many of them these days, both free and paid. They are easy to install and easy to use, but many people still either don’t know about this approach or don’t bother trying it. If this post will convince at least one person to move from memorizing 1-5 passwords to using a password manager, I will be more than happy.

Common approaches

Let’s list common approaches to password management:

One memorized password for everything.
Two passwords: first for important stuff, second for everything else.
Multiple (probably five-ten) passwords - all memorized. This approach makes it easy to forget which password is for which website.
Password system: for example for gmail.com I may have password liamgrotciv (reversed ‘gmail’ plus reversed first name).
A paper or electronic list of all passwords.
Saving all passwords in the browser, e.g. google passwords
Online cloud based password managers, which securely generate and save passwords somewhere in the cloud and allows access to them from all devices.
Offline password managers that securely save passwords somewhere on a certain device. It is the user’s job to synchronize passwords between all devices.
Hardware password managers.

If you use methods 7-9, then you probably have already thought about the importance of this topic. If not, please bare with me for three more minutes.

Reasons

How we manage our passwords becomes more and more important, because we all use tons of services (which almost always require you to have login/password) and their count only grows every day.

Let’s be honest, no matter how good the memory is, sometimes we may forget a password for a particular website or even a home WiFi router.

Also, if you use the same password for multiple services (or even more than one), it is extremely risky. If one little service you signed up long ago, used it once and forgot about it will leak your password, then all your other accounts can be compromised.

Many services allow us to connect using Facebook or Gmail account, but not all. Plus, those services may ask us to allow them to access our Gmail/Facebook contact list and we don’t want that.

Some people may ignore the importance of the problem and think that there is no need to hack them, or they have nothing to hide, or who needs to read their emails. If this is your approach, please, reconsider it. There are so many ways to hurt you by knowing your email password: starting from erasing your cloud drive to getting access to your profile on different social networks.

What a password manager can do for you

generate strong, complex and unique passwords
store passwords, logins and other data (e.g. credit card numbers) securely
synchronize passwords between multiple platforms and devices
backup and restore data
remind to change passwords once in a while (which is a very good practice) and simplify the changing process

Managed cloud based password managers vs local password managers

Local password managers are usually free to use. They represent an app, which you can run on your device (often many platforms are supported). After you provide your master password, the app opens a file, which is stored locally and contains all your password in an encrypted form. Then you can search your username/password for a particular website or create a new one.

KeePassX password manager

You have to carry yourself about synchronization between multiple platforms and devices, doing backups and updating your client. For synchronization, you can use something like [Dropbox]. The approach with your own synchronization allows you to add another layer of encryption. You can use a tool like Cryptomator to encrypt your local files before uploading them to a cloud drive. Encrypting some files in your cloud drive may be a good thing to do even when they have nothing to do with passwords.

The image is from Cryptsync website

Cloud password managers are not free, but they solve all the problems mentioned above (synchronization and backups) and add an extra convenience, as you can integrate them with your browser and won’t need to jump to a different application to copy your username/password.

1Password on the picture below displays a little icon in your login form and can either generate a new password for you (when you are creating a new account) or insert the proper password (when you want to login using an existing account). It makes life incredibly simpler.

Unfortunately the browser integration adds another layer of software, which an attacker can use. E.g.: read more details on how LastPass (probably, the most popular password manager) leaked passwords through chrome extension. It is still not a reason for not using browser integration. Such issues are fixed very quickly (often before they become known publicly) and the good thing is that browser updates your extensions automatically and you won’t have to worry about this.

Which password manager is better

There is no right answer. I prefer 1Password or KeyPassX + Cryptomator + CloudDrive. But be aware, that most password managers were found to have at least minor security exploits one or multiple times over the years on at least one platform they support and there will be more in the future. So it’s important to always keep your client up to date.

How long my passwords should be?

There is no need to argue on this topic. Longer - better, with as many different symbols as possible. You won’t have to memorize them anyway. In my opinion 16-32 symbols is a good interval.

What about my master password?

Take this very seriously. It should be long (a sentence with at least six words and numbers) and hard to guess. The most important thing to keep in mind is that you should never forget your master password.

Good password managers should not allow you to reset your master password. If they do, then this is a potential hole in security system. Simply put it: if you have an option to reset your master password, then your service provider has this option as well.

Conclusion

Any password manager is better and much more secure than none. Download a password manager and start using it. It will only take fifteen minutes to set up. Then you will love it.

Serverless with AWS Lambda and API Gateway: not a beginner tutorial

Wed, 16 May 2018 17:38:36 +0000

I have been working intensively on building serverless applications with AWS over the past few years. Some projects used serverless architecture with lambda in its core, some used lambda functions only for small parts of the system.

Below I want to share a couple of lessons learned and describe bottlenecks, which you may face while developing even a simple system. This is not a tutorial for absolute beginners. I will not be talking about why serverless approach is good or bad and I expect that you are familiar with AWS, learned about lambda and have played a bit with it.

Overview

A simplified architecture of serverless application may look like this:

I intentionally omitted Route53, VPC and a couple of other services (which you most likely will need) to keep focus on the core serverless components.

Usually we use S3+CloudFront for all the static content and all the requests go through API Gateway to lambda, which is used as a backend-core. CloudWatch also plays an important role in aggregating logs and in some cases triggers lambda functions at a specific day/time. From lambda we can access a database to save/load the data, SES/SNS to send notifications and multiple 3rd party services.

You can read more details on serverless webapp creation here. Now let’s focus on some issues, which you may face while designing and developing a serverless application.

VPC or no-VPC?

VPC is one of the core services and is used in almost every project, however there are a couple of extra things we need to consider with serverless architecture.

Usually, a database is placed into a private subnet and instance which requires an internet access and database access can be placed into a public subnet. You cannot do it with lambda. Each lambda is assigned a private IP address, but is not assigned any public IP addresses, so you have to place your lambda function into a private subnet. If lambda requires an internet access, you will have to add a NAT Gateway/instance, which costs money and you will end-up paying for what you could get for free in a non-serverless approach.

Another important thing to watch for when using VPC+Lambda combination is the number of available IP addresses in your subnet. And it may not be easy, as you don’t control the number of lambda functions running at the moment. So one of the Lambda Best Practices is: Don’t put your lambda function in a VPC unless you really have to.

RDS and SQL databases in serverless world

SQL databases are not easy to scale. It becomes even more complex in a serverless app, where you don’t control how many lambda you have at any given moment, so your database may quickly become a bottleneck. However even in a simpler scenario, where we don’t expect thousands of lambdas running at the same time, we may face potential problems.

It’s considered good practice to place a heavy initialization code (e.g. connection to the database, loading and applying configuration) out of the lambda handler, because you only want to execute it once instead of loading exactly the same configuration objects on every request.

# This will be run only once for each lambda (when it's created)
config = load_configuration()

# This will be run on every call to lambda function
def handler_name(event, context):
    # use configuration object
    ...
    return some_value

Creating a database connection only once in the beginning and reusing it for multiple requests makes sense as it affects performance (in a good way). However with the lambda we will face the following problem: we don't know when to close the connection and the lambda function may be *killed* at any moment. As a result, we may end up with many stale connections.

So, creating a database connection every time the request is received and closing it after the request is handled is not such a bad idea when using RDS + Lambda.

Warming up your lambdas

If your lambda is not used, it will eventually be killed. The good thing is that it all happens automatically, AWS scales it up and down depending on the load, and we don’t have to worry about this.

The bad thing is that, eventually, we may have a situation when there are no running lambdas at the moment. The one will be created as soon as customer triggers a corresponding functionality. The problem with that is lambda initialization may take a couple of seconds, plus some time is required to handle the request. 2018 is not the year when customer should be waiting 5-6 seconds (network delay is not even considered here) for a response from the server. It may hurt your business. The solution is simple: we can create a CloudWatch rule which will trigger the lambda ever X minutes (There are no information on how often exactly the lambda should be triggered to be alive. You may do your own experiments or research, but ~15 min should solve the problem) to keep at least one instance of lambda alive. In the handler we can catch this keep-alive CloudWatch event and return from function immediately.

However we should remember that each call to lambda does affect our budget.

Deployment

Obviously we want an easy automated deployment. Fortunately we have multiple options here:

CloudFormation/Terraform
Serverless Framework - wrapper on CloudFormation, which give a very nice experience, has tons of plugins and works across different clouds
rich API which AWS provides
apex

Usually we want to have multiple stages: dev, qa, prod and we will need to run different lambda functions in each stage and you can’t have two lambda functions with the same name in the region. I saw very different approaches to solve this problem and support multiple deployment stages:

different suffixes for API and lambda in the same region
using different regions
using different accounts for dev and prod
lambda versions and aliases & API Gateway stage variables

The last two options are the most convenient in my opinion and probably the most popular.

Lambda configuration

It’s always convenient when your configuration is decoupled from your code. It simplifies many things and we won’t have to redeploy the whole package when everything we want to do is changing log level from Info to Warning.

This is the topic which is discussed often enough during breaks on conferences and meetups. The article Configure your lambda functions like a champ and let your code sail smoothly to Production describes the topic really well. Please, read it, it is worth it.

I will just mention once again, that in your lambda handler you receive two objects: event and context; and you can check what version of lambda was called (to load the corresponding configuration from S3/DynamoDB/etc) using code like this:

alias = context.invoked_function_arn.split(':')[-1]
if alias == 'prod':
    ...
else:
    ...

Other

The article is already quite long and I haven’t mentioned even half of what I wanted to say. I will be writing separate blog posts on other serverless topics. For now I just want to mention a couple of important features, which may be useful:

API Gateway caching - is a great and easy way to offload the lambda function, but remember, that it doesn’t support different parameters in the same request. For example, the two following requests may return the same result when caching is enabled:

GET /v1/products?p1=1&p2=2

GET /v1/products?p1=29&p2=28

CORS - another piece of WebLogic which we can keep on API Gateway side and it’s quite easy. Let’s never set Access-Control-Allow-Origin to ‘*’ in production unless it’s actually needed
lambda policies - easy to add and it has to be done in a granular level for each alias of your lambda function. If lambda only needs to be called by API Gateway and only by POST request with a unique path, then write a corresponding lambda policy and allow only what is necessary
Serverless application security is in not in a bad state, but it continues to be a challenge and developer still has to carry about many things, and the most popular one SQL injection is still here.

A safe approach to project setup

Sun, 08 Jan 2017 12:24:00 +0000

A ship will sail the way you build it.

TLDR

Proper project setup may require more effort in the beginning, but it will save you months in the long term by reducing the likelihood of a mistake and by simplifying the learning curve for new developers.

Here are some techniques (in random order) that you should consider using in your development process and your project setup. The bolded items will be the focus of this blogpost.

source code control
clear project structure
maximum warning level
unit tests
style consistency
continuous integration
static code analysis
dynamic code analysis
code reviews
benchmarking
integration tests

See sample project on Github. It sets warning level to maximum and treats all warnings as errors; it also uses code style analysis as well as unit testing and mocking frameworks. The project was written in C++ (and this blog post is also a bit C++-centric), but everything discussed here is really language agnostic. I used googletest and Google CppLint in my example, but other libraries could be also easily used. If you’re from the C++ world, take a look at great talk by Marshall Clow about project setup from CppCon 2014.

Background

I have seen multiple different project setups and the varying outcomes, that resulted. This topic is extremely important, because proper setup will save development time, reduce the number of errors and inconsistencies, simplify the learning curve for new developers and make everybody’s life easier in general.

There are not so many books that emphasize the importance of this topic. As a new developer, you can easily learn about this by looking at some high-quality projects on GitHub, but you need to know which projects to look at. Plus, if you’re just starting your career, usually you do not focus your attention on project setup approaches. However proper project setup has a huge impact on product quality - just as a foundation impacts the quality of the house it supports.

Let’s go through items from the list above step by step.

Source code control

Not much to comment on here. Fortunately, nowadays everyone is using it. The important thing here is to be consistent with your work-flow (are you using gitflow workflow or something else? Do you have a separate branch for each feature? What versioning approach are you using? Do you create a tag for every release? Do you have a release branch? Or both?).

Clear project structure

By clear I mean that it should be not only intuitive and logical, but that it should also align with worldwide known practices. For example, a common practice is to have folders like src, include, build, test and have a Readme file in the root folder. So it would be unexpected to see the main Readme five directories down or half of the source files in the build folder.

Warnings

Every warning should be treated very seriously and cannot be ignored. Compiler developers put tons of effort and do an amazing job to prevent errors by directing us to potentially dangerous pieces of code.

It’s always a good idea to enable the maximum warning level (/Wall, /Wextra) and enable the setting “treat warnings as errors” (/Werror) in new projects.

I witnessed a situation when a colleague of mine spent hours trying to figure out the reason behind the random crash in a huge project. He didn’t pay attention to warnings, because the project had thousands of them. The answer was simple:

void foo() {
    std::string s("test");
    printf("%s", s);  // Oops! char* (s.c_str()) is expected.
}

This situation occurred long ago. Actually, these days many (but not all) modern compilers would generate an error and not a warning on the printf line. Still, this example illustrates the situation well. If the project was better set up, it would take two seconds to fix it or it would likely not happen at all.

It’s harder with legacy code, which has hundreds of warnings, but I have no doubts, that it’s worthwhile to invest time and clean all those warnings slowly, step by step.

Unit tests

This item is the most important one. If you want to integrate only one item from the list above, integrate this one. Unit tests have been popular for a long time. There are hundreds of books about this topic. All the languages I have worked with (procedural, OO, functional and even very specialized ones like LabVIEW) have free libraries for unit testing. Many languages even come with unit test libraries out of the box; but surprisingly, there are tons of serious projects out there (even new ones, not only legacy projects), which don’t use tests.

Not only do unit tests help you to verify your code through different execution paths and edge cases, prevent regression errors, and make new developers more confident in making changes, but they also force you to design your classes/interfaces/functions better and decouple modules from each other.

Another approach I see very often is writing unit tests, but not using mock objects. Many developers don’t know about them. It’s definitely better having tests without mocks than not having tests at all. However when you don’t use mocks, you are most likely using the Unit Test library to create integration tests. As a result, you don’t test the failure path, you don’t simulate exceptions which may be thrown in the dependency of testable objects, and you don’t decouple modules from each other well enough. The result is likely that the execution time of your test is very long. Mock frameworks are very powerful (and very often free) tools, so there is no reason not to use them. This example of unit test illustrates how valuable mocks can be (MockObject is injected into testable object as its dependency).

Moreover, don’t go to the other extreme - mocks returning mocks returning mocks is a sign of a bad design. This post from Gary Bernhardt explains very well what ‘too many mocks’ means.

Unit test execution should be part of the build.

In my opinion, it’s important to run your tests as part of the build (and keep execution time short enough). If your total test execution time takes hours, then nobody will bother running tests before pushing new changes.

Tests almost always take a long time (let’s exclude situations when your project has hundred millions lines of code and you have millions of tests) because you need to create and then remove databases, or send a network request. This means, you’re writing an integration test. It may be useful to have integration tests, but they should be an addition to the unit tests, not a replacement. Nevertheless, integration tests are not in the scope of this post.

Unlike in the past, everyone uses source code control these days. I have no doubts, that eventually unit tests will be where source code control is now - everywhere. Start using it today (if you aren’t already) and your product quality will improve dramatically, allowing you to pull ahead of your competitors.

Style consistency

Do you like seeing spaces and tabs in the same file? Do you enjoy having CamelCase and underscore_style classes/methods/variables in the same project? What about methods which are two thousand lines long, half of those are commented out? Or ten empty lines between two lines of code?

I know people who don’t care about that, but I don’t really know anyone who enjoys it. Most people want the code to be consistent. Consistency makes code easier to read and maintain. If you’re using well-known code style practices (which you should), it will make the life of new developers much easier. Yet, it’s hard to be consistent even when you are the only one who writes the whole project. I believe that the only way to achieve style consistency is to use a tool that analyzes all your files. It is important to run this tool as a part of your build and then treat output of the tool as compilation errors. All the languages that I know have such tools (often you have many of them for your language): StyleCop for C#, Vera++ or Google CppLint for C++, Go lang comes with a style checker out of the box, etc.

Don’t worry if you may have to customize or disable some rules. Every team/project has different preferences.

In my example project I used Google CppLint and I enabled usage of C++11 features, streams and references. It also worth the mention that if you’re from C++ word, it’s a good idea to keep an eye on core guidelines and the automation tools that come with them.

Conclusion

Chaos creates chaos. If from the start your project has multiple warnings, no tests, tricky project structure and inconsistent style, all future commits will look the same. The probability of mistakes will be high, even if you hire top-notch developers to add a new feature.

Invest time in the project setup from the very beginning and always pay attention to the structure of great projects in order to learn from them.

Raw pointers in modern C++

Sat, 01 Oct 2016 03:42:00 +0000

- Hi. May I have 3 owning raw pointers and a couple of delete keywords, please?
- No, sir. Not anymore. Not in 2016.

The rule

There is almost no need to use owning raw pointers and delete keyword in today’s C++. It is highly unlikely that your situation is exceptional unless you’re working on a brand new, super cool, memory manager library, or another very specific low-level tool.

Owning vs non-owning

The word ‘owning’ is important. All problems described below are related to owning pointers. There is nothing wrong with non-owning raw pointers.

void foo(const SomeClass* const p1) {
    SomeClass* p2 = new SomeClass();
    // p1 - non-owning, p2 - owning
    // ...
}

What’s wrong with raw pointers

The raw pointer was an amazing invention in its time. They simplified development and pushed our industry forward. However, that time has passed and today, they generate a lot of potential problems. Some of them are shown below. Let’s take a look at this simple snippet. Do you see the mistake?

void use_resource(const int n) {
    auto buff = new int[n * 1024];
    // do some operation with the buff
    delete buff;
}

First, we have a well-known memory leak problem. If the exception happens between allocation and deallocation, then memory owned by buff will not be released. And even if you never use exceptions and you’re sure that all people who will work with your code do the same, there is still a danger. For example, in time someone may write a validation line and jump out of the method before deallocation happens:

if (n > 1000) { // some validation
    return;
}

Another problem with the use_resource() snippet above is that I accidentally used delete instead of delete[] (we are people and people make mistakes sometimes, right?). And, guess what? Memory leak? It’s actually worse - we got undefined behaviour.
Now let’s try to be more defensive and wrap buff allocation and de-allocation into a class:

class Resource {
public:
    Resource() {
        buff_ = new int[10];
        for (int i = 0; i < n; ++i) {
            buff_[i] = i;
        }
    }

    void use() {
        // do some operation with the resource, print second element for example
        printf("%d\n", buff_[1]);
    }

    ~Resource() {
        delete[] buff_;
    }

private:
    int* buff_
}

It’s better, right? Now, we can write:

void use_resource(const int n) {
    Resource buff;
    // do some operation with the buff
} // here buff destructor will be called automatically

The code became simpler and we don’t need to worry about memory leaks because the destructor (which will be called automatically when buff goes out of scope) will take care of everything. Unfortunately, this code still has problems. Imagine the following use case:

void add_resource(std::vector<Resource>& resources, bool save) {
    Resource new_resource;
    // do something with new resource
    new_resource.use();
    if (save) { // save it for future usage
        resources.push_back(new_resource);
    }
}

int main() {
    std::vector<Resource> resources;
    // create 2 resources and save both of them to the vector
    add_resource(resources, true);
    add_resource(resources, true);
    for (auto& r : resources) {
        r.use();
    }
}

Here, when we push resources to the vector, a copy of our object is created. And, oops, I forgot to create a copy constructor, so the compiler generated one for me. The generated constructor performs a so-called shallow copy, which means it copies only the pointer and not the data. We have two objects that point to the same data. As soon as one of them goes out of scope (the end of add_resource() method), the data is removed. Our resources vector will store dangling pointers. So line r.use_resource() leads to undefined behavior. If you’re lucky your program will crash, if not… well, anything can happen.

Raw pointers have a lot of other disadvantages. See Scott Meyers’s book for details or do your own research.

Why do people still use it?

Legacy code is a separate topic. C++ is an old language and there are tons of projects written in it, so obviously they cannot be refactored quickly. Nevertheless, why are owning raw pointers still used so heavily in modern C++ in recent projects? The only answer I have is “bad habit”. There were multiple talks, books, blogs (see below), but not everyone unlearned the old stuff and learned the better way.

So… What to use instead?

In short - use RAII idiom

Allocate object in the stack if possible
Use smart pointers
Only as a last step - wrap your resource into its own class, where you acquire in constructor and release in destructor, but do not forget about copy/move constructor and assignment operators, which may be generated for you

A couple of links

Still not convinced? Take a look at amazing text/videos from C++ experts:

Herb Sutter’s blog post on modern C++
Bjarne Stroustrup’s talk at CPPCON 2014. Jump to 30:50 for Resource example
Scott Meyers’ Effective Modern C++ Book Chapter 4