Why Collect Information?#
The purpose of intelligence gathering is to obtain accurate information about the penetration target to understand how the target organization operates and determine the best attack route, all of which should be done quietly without letting the other party detect your presence or analyze your intentions. One of the most important stages of penetration testing is information gathering. To initiate penetration testing, users need to collect basic information about the target host. ==The more information the user obtains, the higher the probability of successful penetration testing==.
Classification of Information Gathering#
- Passive Information Gathering: ==Accessing the target using third-party services==: Google search, Shodan search, and other comprehensive tools. Passive information gathering refers to collecting as much information related to the target as possible.
- Active Information Gathering: ==Directly scanning the target host or website==. The active method can obtain more information, and the target system may log operational information.
What Information Should Be Collected?#
IP Resources | Server Information | Website Information | Human Resources |
---|---|---|---|
Real IP | Operating system type and version | CMS | Domain owner, registrar |
Side site information | Open ports | WAF | Phone number |
C-class hosts | x | Web middleware | |
x | x | Development language | Various privacy |
x | x | Database | x |
x | x | API, specific files | x |
Information Gathering Methods#
1. Real IP#
01. Determine if it is a Real IP#
When talking about real IPs, let's briefly introduce what
CDN
technology is. Its Chinese name is ==Content Delivery Network==. To ensure network stability and fast transmission, website service providers set up node servers at different locations on the network and use CDN technology to distribute network requests to the optimal node servers.
- Online Website Query
Website tools: http://ping.chinaz.com/
Aizhan: https://ping.aizhan.com/
==If there are multiple different response IPs, it indicates that there may be a CDN==.
- nslookup
If the domain resolves to multiple IP addresses, it is likely using a CDN.
02. How to Find the Real IP (Bypassing CDN)#
1. Look for Subdomain IPs#
Subdomains may be on the same server or in the same C-class network as the main site. By querying the IP information of subdomains, you can assist in determining the real IP information of the main site.
See below ==4. Subdomain Information Collection==.
2. Check Historical DNS Resolution Information#
Check the historical records of IP and domain bindings. There may be records ==before using CDN==, and then ==analyze which IPs are not in the current CDN resolution IPs==, which ==may be the real IP without CDN acceleration==.
- viewdns.info DNS historical record website, which records changes over the years.

-
securitytrails.com A large DNS historical database. I tried it and it can find IPs and server room information used by websites over the years, which is quite alarming. (Requires registration to use)
Syntax: domain:baihe.com type:A
Just enter the website domain in the search field and press Enter. The "Historical Data" can then be found in the left menu.
- Cloudflare's Advice
==A, AAAA, CNAME, or MX records pointing to your origin will expose your original IP.
==
So you can check the DNS resolution records corresponding to the domain.
3. Use Foreign Hosts for Direct Detection#
Another method, if you don't have foreign hosts, is to use public multi-location ping services. Multi-location ping services have foreign detection nodes, and you can use ==the ICMP response information returned from foreign nodes== to determine the real IP information.
- Foreign node ping addresses
http://www.webkaka.com/Ping.aspx
4. Check the Email Server IP from Emails Received#
- RSS email subscriptions. Many websites come with sendmail, which will send emails to us. At this time, checking the email source code will include the server's real IP.
- If the target system has a mailing function, it usually sends emails during user registration/password recovery, etc. By checking the original email sent by the system, you can view the sender's IP address.
- DNS's MX records (see point 2 above).
5. Certificate Query#
The principle is to send a client hello to the IP's 443 port. The server replies with a server hello that contains the SSL certificate, and the ==common name in the SSL certificate contains domain information==. This way, you can know the domain that resolves to this IP. So more accurately, the IP's 443 port may expose the domain.
https://search.censys.io/# Check historical certificates.
Syntax:
parsed.names: 4399.com and tags.raw: trusted
Only show valid certificate query parameters: tags.raw: trusted

Censys will show you all standard certificates that meet the above search criteria. The above certificates were found during scanning.
Just click on any certificate.

6. Use zmap to Capture Target IP Segment 80 Banner Information#
Randomly scan 10,000 IPs on port 80.
zmap -B 10M -p 80 -n 10000 -o results.csv
Loop through the obtained IPs and use curl to print out the banner.
for i in `zmap -B 10M -p 80 -n 10000`; do curl -s -I "$i" >> out1; done
Then match the target domain's ==same banner==; that IP is the real IP.
7. Domain Tweaking#
In the past, when using CDN, there was a habit of only allowing the WWW domain to use CDN, while the naked domain did not use it, to make it more convenient to maintain the website without waiting for CDN caching. So try removing the www from the target website and ping to see if the IP changes.
8. Social Engineering#
If you have obtained the target website administrator's account in CDN, you can find the website's real IP in the CDN configuration.
2. Side Site Information Collection#
Side sites are different websites on the ==same server as the attack target==. When the attack target has no vulnerabilities, you can find vulnerabilities in the side sites, attack the side sites, and then escalate privileges to gain the highest permissions on the server.
- nmap port scanning
nmap -sV -p- real_ip -v -oN xxx.txt
- Online query websites
3. C-Class Information Collection#
C-class hosts refer to servers that are ==in the same C-class network as the target server==. The live hosts in the target's C-class are important information for information gathering. Many internal servers of units and enterprises may be in the same C-class network.
- nmap
nmap -sn real_IP/24 -v -oN xxx.txt
-n (do not use domain name resolution)
Tells Nmap to never perform reverse DNS resolution on the active IP addresses it discovers. Since DNS is generally slow, this can speed things up.
- Use Google, syntax: site:125.125.125.*
4. Subdomain Information Collection#
01. Subdomain Bruteforce Tools#
- ==AllinOne== https://github.com/shmilylty/OneForAll
A Python tool, OneForAll requires a version higher than Python 3.6.0 to run. OneForAll will generate corresponding results in the results directory upon normal execution with default parameters. Install dependencies before use: pip install -r requirements.txt.
python3 oneforall.py --target example.com run
python3 oneforall.py --targets ./example.txt run
- JSFinder See below ==8.06==.
- ESD (Download from GitHub, but I encountered errors using it).
# Scan a single domain
esd -d qq.com
- subfinder (Download from GitHub, requires Go language).
subfinder -d hackerone.com
Used with httpx
, it can find running HTTP servers (httpx is written in Go).
echo 4399.com | subfinder -silent | httpx -ip > subdomain_list
httpx --silent only outputs the domain.

02. Online Query Websites#
- ==Search Engines to Discover Subdomains==
Baidu Search Engine
site:baidu.com
Google Search Engine
site:baidu.com
https://fofa.info/
https://www.shodan.io/
https://x.threatbook.com/v5/mapping
https://www.dnsdb.io/zh-cn/ Useful but requires membership for extensive use
Input baidu.com type:A
.

5. Determine Operating System Type and Version#
- nmap
nmap -O 192.168.88.21
-
Check if the website URL is case-sensitive (not case-sensitive is Windows, otherwise Linux).
-
Windows TTL value is generally 128 (or >100), while Linux is 64.
6. Website Owner Information Collection#
Helpful for dictionary creation.
01. whois#
Whois (pronounced "Who is", not an abbreviation) is a protocol used to query domain IP and owner information. In simple terms, whois is a database used to check whether a domain has been registered and to provide detailed information about the registered domain (such as ==domain owner, domain registrar==). Whois is used to query domain information. Early whois queries were mostly command-line interfaces, but now some web interfaces have emerged to simplify online query tools, allowing queries to multiple databases at once. Web interface query tools still rely on the whois protocol to send query requests to servers, while command-line interface tools are still widely used by system administrators. Whois typically uses the TCP protocol on port 43. Each domain/IP's whois information is maintained by the corresponding management organization.
==The WHOIS information for each domain or IP is maintained by the corresponding management organization==. For example, the WHOIS information for .com domains is managed by the .com domain operator VeriSign, while the national top-level domain .cn in China is managed by CNNIC.
02. Social Engineering#
Assuming we have obtained information through the target's colleagues, such as the target's real name, contact information, work hours, etc. *A skilled social engineer will organize, classify, and filter the information to construct a carefully prepared trap, allowing the target to walk into it.*
03. Personal Information Retained by Official Websites#
Generally, companies will place official contact information on their official websites, which can be used to collect email and phone information.
04. Recruitment Information Collection#
Recruitment information on job websites contains a lot of personnel-related information. Recruitment information involves electronic mail, phone numbers, and other related information of the recruited personnel, while job seekers' resumes contain very detailed personal information such as names, phone numbers, emails, and work experience. If there are security vulnerabilities on the recruitment website, job seekers' resumes may be leaked.
05. ICP Filing Information#
Know company information
, filing review time
.
https://beian.miit.gov.cn/#/Integrated/index
06. Exposed Locations#
The same effect as point 2 above.
- View individual certificate information.
07. Check Company Information#
08. Obtain Email Information#
09. Others#
(1) Look for usernames directly on the web (as they generally have emails, you can get usernames based on company names or numbers to generate corresponding dictionaries).
(2) Use Google syntax to search for xlsx, etc., or directly search for this company-related information, which may reveal usernames.
(3) Check GitHub for this company to see if there are any leaks.
(4) Look for interviewers on job websites, as they may leak phone numbers and usernames, and check usernames based on phone numbers.
(5) Search for the company's organizational chart and note down any leaders.
(6) Use public accounts, Weibo, and other social media to search for company information.
(7) Use Baidu Images (this depends on luck; sometimes web searches yield too many results, so directly looking at Baidu Images may reveal usernames quickly; I thought of this when I needed to find a number during a previous attack-defense exercise, but the number was too blurred to see clearly).
(8) Look for commonly used username dictionaries for collection.
7. Identify CMS#
A Content Management System (CMS) is a system for managing website content. CMSs have many ==template-based excellent designs==, ==which can speed up website development and reduce development costs==. The functionality of a CMS is not limited to text processing; it can also handle images, Flash animations, audio and video streams, graphics, and even email archives. CMS is actually a broad term that can refer to anything from general blog programs and news publishing programs to comprehensive website management programs.
01. Manual Identification#
- ==The footer may expose the CMS==
powered by ...
- ==robots.txt file==
Determine this CMS through a specific path
.
- ==Response Header Information==
cookie section
.
- ==Website Backend==
The website's backend login interface also has characteristic codes of the CMS.
- Determine based on URL routing, such as wp-admin.
02. Fingerprint Recognition Tools#
The main development idea: establish a connection with request --- obtain webpage content --- use regular expressions to match keywords --- identify CMS type.
- ==Chrome Extension -- Wappalyzer==

- Common tools include
CMSeek
.
03. Online CMS Recognition Websites#
http://whatweb.bugscaner.com/look/
8. Identify Web Middleware#
- Response headers.
- Determine based on error messages.
- Determine based on default pages.
9. Internet Asset Collection#
Includes historical vulnerability information, GitHub source code leaks, SVN source code information, leaked cloud disk file information, etc.
01. Historical Vulnerability Information#
Google search for relevant software vulnerabilities.
02. GitHub Source Code Information Leaks#
GitHub is a hosting platform for open-source and private software projects, and many people like to upload their code to the platform. ==Attackers can search using keywords== to find ==sensitive information about the target site==, and even download the website source code.
When developers use git for version control, after initializing a repository in a directory, a hidden folder named .git
is created in that directory, which contains all versions and a series of information about the repository. ==If the server places the .git
folder in the web directory==, it may allow attackers to obtain all source code of the application using the information inside the .git
folder.
- GitHub syntax search
in:name | vue in Matches repositories containing "jquery" in their names. |
---|---|
in:description | vue in,description Matches repositories containing "vue" in their names or descriptions. |
in:readme | vue in Matches repositories mentioning "vue" in their readme files. |
repo:owner/name | repo/blog Matches specific repository names, such as the blog project of user biaochenxuying. |
For more details on search syntax, see
https://github.com/FrontEndGitHub/FrontEndGitHub/issues/4
- GitHack, to pull source code.
A `.git` folder disclosure exploit
03. Backup Site Compressed Packages#
Attempt to obtain through directory scanning.
04. SVN#
You can use the .svn/entries file to obtain server source code, SVN server account passwords, and other information. A more serious issue is that the .svn directory generated by SVN also contains source code file copies ending with .svn-base (for lower versions of SVN, the specific path is the text-base directory, while for higher versions, it is the pristine directory). If the server does not parse such suffixes, hackers can directly obtain the source code files.
Details
https://cloud.tencent.com/developer/article/1376492
- Source code restoration tool
SvnExploit支持SVN源代码泄露全版本Dump源码
05. DNS Information Leaks#
A. MX Record Leaks
06. API Leaks#
07. Other Sensitive Files#
First check which CMS is being used, and then scan according to that CMS's directory structure.
If no CMS is used, use conventional sensitive file name dictionaries for scanning, such as:
- robots.txt
- crossdomain.xml
- sitemap.xml
- xx.tar.gz
- xx.bak
- phpinfo
08. Cloud Disk Search#
Lingfengyun Search
Xiaobaipan Search
- Address: https://www.xiaobaipan.com/
Dali Pan Search
- Address: https://www.dalipan.com/
Xiaobudian Search (Weipan)
- Address: https://www.xiaoso.net/
Baidu Cloud Disk Crawling Open Source Tool
- Address: https://github.com/gudegg/yunSpider
09. Information Leaks Related to Vulnerabilities#
Google search for relevant middleware information leaks.
10. WAF Identification#
WAF Functions
Find WAF by looking at the image.
https://blog.csdn.net/weixin_46676743/article/details/112245605
Tools
- WAFW00f
WAFW00F allows one to identify and fingerprint Web Application Firewall (WAF) products protecting a website.
Or manually input incorrect URIs and SQL statements, and XSS to see if you can trigger WAF alerts.
- nmap -p 80 --script http-waf-detect.nse 4399.com
11. Port Scanning#
Method for scanning all ports.
nmap is slow.
nmap -sV -Pn -p- 1.1.1.1 -oX result.xml
masscan is fast but sometimes inaccurate.
masscan --open --banners -p- 1.1.1.1 --rate 1000 -oX result.xml
Common Port Vulnerability Information Table
Port Number | Service | Attack Methods |
---|---|---|
21/22/69 | ftp/tftp | Brute force, sniffing, overflow, backdoor |
22 | ssh | Brute force, 28 backspaces |
23 | telnet | Brute force, sniffing |
25 | smtp | Email forgery, brute force |
53 | dns | DNS zone transfer, DNS hijacking, DNS cache poisoning, DNS spoofing, DNS tunneling |
67/68 | dhcp | Hijacking, spoofing |
110 | pop3 | Brute force |
139 | samba | Brute force, unauthorized, remote code execution |
143 | imap | Brute force |
161 | snmp | Brute force |
389 | ldap | Injection, unauthorized |
512/513/514 | linux r | Directly use rlogin |
873 | rsync | Unauthorized |
1080 | socket | Brute force, internal penetration |
1352 | lotus | Brute force, weak passwords, information leakage (source code) |
1433 | mssql | Brute force, injection |
1521 | oracle | Brute force, injection, TNS remote poisoning |
2049 | nfs | Misconfiguration |
2181 | zookeeper | Unauthorized |
3306 | mysql | Brute force, injection, denial of service |
3389 | rdp | Brute force, shift backdoor |
4848 | glassfish | Brute force, console weak passwords, authentication bypass |
5000 | sybase/db2 | Brute force, injection |
5432 | postgreSQL | Brute force, weak passwords, injection, buffer overflow |
5632 | pcanywhere | Denial of service, code execution |
6379 | redis | Unauthorized, brute force, weak passwords |
7001 | weblogic | Deserialization, console weak passwords, console deployment webshell |
8069 | zabbix | Remote command execution |
8080-8090 | web | Common web attacks, brute force, middleware vulnerabilities, CMS version vulnerabilities |
9090 | websphere | Brute force, console weak passwords, deserialization |
9200/9300 | elasticsearch | Remote code execution |
11211 | memcached | Unauthorized |
27017 | mongoDB | Brute force, unauthorized |
12. Directory Scanning#
-
gobuster https://github.com/OJ/gobuster
-
dirsearch https://github.com/maurosoria/dirsearch
13. APP Information Collection#
14. Other Information Collection Channels#
- Zhihu
- Tieba
- Social Engineering Database
- Telegram
Send any email to http://tool.chacuo.net/mailanonymous and https://emkei.cz/
Temporary email http://www.yopmail.com/
Email pool group http://veryvp.com/
Methods to Prevent Information Gathering#
If website administrators want to prevent their websites from being subjected to preliminary information gathering by hackers, they can modify the webpage's characteristic information:
(1) Modify webpage display information (webpage templates, technical support, keywords, version information, backend login module information, etc.)
(2) Modify webpage path information (/robots, /admin, etc.)
(3) Personalize webpage information; modify webpage path information can be done by using pinyin abbreviations or personalized methods to hide the generic path names of the website building vendors, for example, changing /admin to /a8min.