Masscan is a fast network scanner that is good for scanning a large range of IP addresses and ports. We’ve adapted it to our needs by giving it a little tweak.
The biggest inconvenience in the original version was the inability to collect banners from HTTPS servers. And what is a modern web without HTTPS? You can’t really scan anything. That’s what motivated us to modify masscan. As it usually happens, one little improvement led to another one, with some bugs being discovered along the way. Now we want to share our work with the community. All the modifications we’ll be talking about are already available in our repository on GitHub.
What are network scanners for
Network scanners are one of the universal tools in cybersecurity research. We use them to solve such tasks as perimeter analysis, vulnerability scanning, phishing and data leak detection, C&C detection, and host information collection.
How masscan works
Before we talk about the custom version, let’s understand how the original masscan works. If you are already familiar with it, you may be interested in the selection of useful scanner options. Or go straight to the section “Our modifications to masscan.”
The masscan project is small and, in our opinion, written scrupulously and logically. It was nice to see the abundance of comments — even deficiencies and kludges are clearly marked in the code:
Logically, the code can be divided into several parts as follows:
- implementation of application protocols
- implementation of the TCP stack
- packet processing and transmission threads
- implementation of output formats
- reading raw packets
Let’s look at some of them in more detail.
Implementation of application protocols
Masscan is based on a modular concept. Thus, it can support any protocol, all you need is to register the appropriate structure and specify its use everywhere you need it (ha-ha):
Here’s a little description of the structure.
The protocol name and the standard port are informative only. The сtrl_flags
field is not used anywhere.
The init
function initiates the protocol, parse
is the method responsible for processing the incoming data feed and generating response messages, and cleanup
is the cleanup function for the connection.
The transmit_hello
function is used to generate a hello packet if the server itself does not transmit something first, and the data from the hello
field is used if the function is not specified.
The function that tests the functionality can be specified in the selftest.
Through this mechanism, for example, it’s possible to write handlers in Lua (the option --script
). However, we never got around to checking if it really works. The thing we came across with masscan is that most of the interesting options are not described in the documentation, and the documentation itself is scattered in different places, partially overlapping. Part of the flags can only be found in the source code (main-conf.c
). The --script
option is one of them, and we have collected some other useful and interesting functions in the section "Useful options of the original masscan."
Implementation of the TCP stack
One of the reasons why masscan is so fast and can handle many simultaneous connections is its native implementation of the TCP stack*. It takes about 1,000 lines of code in the fileproto-tcp.c
.
* A native TCP stack allows you to bypass OS restrictions, not to use OS resources, not to use heavier OS mechanisms, and to shorten the packet processing path
Packet processing and transmission threads
Masscan is fast and single-threaded. More specifically, it uses two threads per each network interface, one of which is a thread to process incoming packets. But no one really runs on more than one interface at a time.
One thread:
- reads raw data from the network interface.
- processes this data by running it through its own TCP stack and application protocol handlers.
- forms necessary data to be transmitted.
- stacks them in the
transmit_queue
.
The other thread takes the messages prepared for transmission from transmit_queue
and writes them to the network interface (Fig. 1). If the messages sent from the queue do not exceed the limit, SYN packets are generated and sent for the next scanning targets.
Implementation of output formats
This part is conceptually similar to the modular implementation of protocols: it also has the OutputType
structure that contains the main serialization functions. There's an abundance of all possilble output formats: custom binary, the modern NDJSON
, the nasty XML
, and the grepable. There's even the option of saving data to Redis. Let us know in the comments if you've tried it :)
Some formats are compatible with (or, as the author of masscan puts it, inspired by) similar utilities, such as nmap and unicornscan.
Reading raw packets
Masscan provides the ability to work with the network adapter through the PCAP or PFRING libraries, and to read data from the PCAP dump. The rawsock.c
file contains several functions that abstract the main code from specific interfaces.
To select PFRING, you have to use the --pfring
parameter, and to enable reading from the dump, you have to put the file
prefix on the adapter name.
Useful options of the original masscan
Let’s take a look at some interesting and useful options of the original masscan that are rarely talked about.
Options
--nmap, --help
Description: Help
Comment: Even combined, these options give very little useful information. The documentation also contains incomplete information and is scattered in different files: README.md, man, FAQ. There’s also a small HOWTO on how to use the scanner together with AFL (american fuzzy lop). If you want to know about all the options, you can find the full list of them only in the source code (main-conf.c)--output-format ndjson
,-oD
,--ndjson-status
Description:NDJSON
support
Comment: Gigabytes of line-by-lineNDJSON
files are much nicer to handle thanJSON.
And the status output inNDJSON
format is useful for writing utilities that monitor masscan performance--output-format redis
Description: Ability to save outputs directly to Redis
Comment: Well, why not?:) If you haven’t worked with this tool, read about it here--range fe80::/67
Description: IPv6 support
Comment: Everything’s clear here, but it would be interesting to read about real use cases in the comments. I can think of scanning a local network or only a small range of some particular country obtained through BGP--http-*
Description: HTTP request customization
Comment: When creating an HTTP request, you can change any part of it to suit your needs: method, URI, version, headers, and/or body--hello-[http, ssl, smbv1]
Description: Scanning protocols on non-standard ports
Comment: If masscan hasn’t received a hello packet from the target, its default setting is to send the request first, choosing a protocol based on the target’s port. But sometimes you might want to scan HTTP on some non-standard port--resume
Description: Pause
Comment: Masscan knows how to delicately stop and resume where it paused. WithCtrl+C (SIGINT)
masscan terminates, saving state and startup parameters, and with--resume
it reads that data and continues operation--rotate-size
Description: Rotation of the output file
Comment: The output can contain a lot of data, and this parameter allows you to specify the maximum file size at which the output will start to be written to the next file--shard
Description: Horizontal scaling
Comment: Masscan pseudorandomly selects targets from the scanned range. If you want to run masscan on multiple machines within the same range, you can use this parameter to achieve the same random distribution even between machines--top-ports
Description: Scanning of N popular ports (arraytop_tcp_ports
)
Comment: This parameter came from nmap--script
Description: Lua scripts
Comment: I have doubts that it works, but the possibility itself is interesting. Is there anyone who uses it? Let me know if you have any interesting examples--vuln [heartbleed, ticketbleed, poodle, ntp-monlist]
Description: Search for certain known vulnerabilities
Comment: We cannot say anything about its correctness and efficiency, since this mechanism of vulnerability detection is a kind of kludge scattered throughout the code and conflicts with many other options, and we did not have to apply it in real tasks
Just to remind you of an important point everyone stumbles upon: masscan probably won’t work if you just run it to collect banners. The documentation does say this, but who cares to read it, right? Since masscan uses its own network stack, the OS knows nothing about the connections it creates and is rather surprised when it receives a packet (SYN, ACK)
from somewhere in the network in response to a SYN request from the scanner. And then, depending on the type and settings of OS and firewall, the OS transmits an ICMP or RST packet, which is extremely adverse to the output. So you need to read the documentation and take this point into account.
Our modifications to masscan
We’ve added HTTPS support
The Internet is quite the fortress these days, even the most backward scammers have already given up on unencrypted HTTP. Therefore, it’s rather inconvenient without HTTPS support — this feature makes investigation, such as searching for C&C servers and phishing, much easier. There’re other tools besides masscan, but they are slower. We wanted to have a universal tool that would cover HTTPS and still be fast.
The first thing to do was to implement a full-fledged SSL. What the original masscan has is the ability to send a predefined hello packet then fetch and process a server certificate. Our version can establish and maintain an SSL connection and analyze the contents of nested protocols, which means it can collect HTTP banners from HTTPS servers.
Here’s how we achieved that. We added a new application-layer protocol to the source code and used the standard solution, OpenSSL, to implement SSL. Here we needed to do some fine-tuning, and the structure describing the application-layer protocol in the custom scanner looks like this:
We added handlers for protocol deinitialization, connection initiation and expanded the set of handler parameters. As a result, it became possible to handle nested protocols. We also managed to implement the change of application protocol handler more precisely. It is necessary when it’s impossible to process data with the current protocol or if such mechanism is embedded in the protocol itself, for example, when using STARTTLS.
Then we had some problems with performance and packet loss. SSL is heavy on the CPU. We had the option to try something faster than OpenSSL, but we went in the direction of processing incoming packets in several threads within one network interface. After implementing this, the packet processing pipeline looks like this:
The th_recv_read
thread is needed to read data from the network interface regardless of the data processing speed. The q_recv_pb
queue helps to detect cases when the data transmission speed is too high, and inbound packets cannot be processed in time. The th_recv_sched
thread dispatches messages based on the hashes of the outbound and inbound IP addresses and ports to the th_recv_hdl_*
threads so that the same connection falls into the same handler. The options related to this functionality are --num-handle-threads
—the number of handler threads, and --tranquility
—for automatic reduction of packet transmission speed when inbound packets cannot be handled fast enough.
HTTPS support is enabled with the parameter --dynamic-ssl
while --output-filename-ssl-keys
can be used to save master keys.
You can also notice a small cosmetic improvement — namely, the names of the threads. In our version, it became clear which threads consume resources:
We’ve improved code quality
Masscan was found to have many strange things and errors. For example, the conversion of time to ticks** looked as follows:
** A unit of time measurement in which there’s enough accuracy, and which does not take up too much space
Network TCP connections were often handled incorrectly, resulting in broken connections and unnecessary repeat transmissions:
We also discovered errors in memory handling, including memory leaks. We managed to fix many of them, but not all. For example, when scanning /0:80
, we see a leak of several ranges of 2 bytes each.
These errors were detected thanks to our colleagues, who meticulously used our developments, static analyzers (GCC, Clang, and VS), UB and memory sanitizers. Separately, I want to thank PVS-Studio. Those guys are unparalleled in quality and convenience.
We’ve added a build for different OSs
To consolidate the outputs, we’ve written a build and a test for Windows, Linux, and macOS using GitHub Actions.
The build pipeline looks like this (Fig. 4):
- format check
- static clang analyzer check
- assembly debugging with sanitizers and running built-in tests
- assembly and sending data to SonarCloud and CodeQL services
You can download compiled binaries from the build or release artifacts:
We’ve added a few more features
Here are the rest of the less significant things that were introduced in our version:
--regex(--regex-only-banners)
is data-level message filtering in TCP. A regular expression is applied to the contents of each TCP packet. If the regular expression is triggered, the connection information will be in the output.--dynamic-set-host
is used to input the headerhost
into a HTTP request. The IP address of the target being scanned is taken as a value.- Output of internal signature triggers on masscan protocols in the output.
- An option to specify URIs in HTTP requests. We removed it later because the author of the original masscan added the same functionality. This is part of the
--http-*
options family.