Information Collection and Summary

Why Collect Information?#

The purpose of intelligence gathering is to obtain accurate information about the penetration target to understand how the target organization operates and determine the best attack route, all of which should be done quietly without letting the other party detect your presence or analyze your intentions. One of the most important stages of penetration testing is information gathering. To initiate penetration testing, users need to collect basic information about the target host. ==The more information the user obtains, the higher the probability of successful penetration testing==.

Classification of Information Gathering#

Passive Information Gathering: ==Accessing the target using third-party services==: Google search, Shodan search, and other comprehensive tools. Passive information gathering refers to collecting as much information related to the target as possible.
Active Information Gathering: ==Directly scanning the target host or website==. The active method can obtain more information, and the target system may log operational information.

What Information Should Be Collected?#

IP Resources	Server Information	Website Information	Human Resources
Real IP	Operating system type and version	CMS	Domain owner, registrar
Side site information	Open ports	WAF	Phone number
C-class hosts	x	Web middleware	Email
x	x	Development language	Various privacy
x	x	Database	x
x	x	API, specific files	x

Information Gathering Methods#

1. Real IP#

01. Determine if it is a Real IP#

When talking about real IPs, let's briefly introduce what CDN technology is. Its Chinese name is ==Content Delivery Network==. To ensure network stability and fast transmission, website service providers set up node servers at different locations on the network and use CDN technology to distribute network requests to the optimal node servers.

Online Website Query

Website tools: http://ping.chinaz.com/
Aizhan: https://ping.aizhan.com/

==If there are multiple different response IPs, it indicates that there may be a CDN==.

nslookup

If the domain resolves to multiple IP addresses, it is likely using a CDN.

02. How to Find the Real IP (Bypassing CDN)#

1. Look for Subdomain IPs#

Subdomains may be on the same server or in the same C-class network as the main site. By querying the IP information of subdomains, you can assist in determining the real IP information of the main site.

See below ==4. Subdomain Information Collection==.

2. Check Historical DNS Resolution Information#

Check the historical records of IP and domain bindings. There may be records ==before using CDN==, and then ==analyze which IPs are not in the current CDN resolution IPs==, which ==may be the real IP without CDN acceleration==.

viewdns.info DNS historical record website, which records changes over the years.

securitytrails.com A large DNS historical database. I tried it and it can find IPs and server room information used by websites over the years, which is quite alarming. (Requires registration to use)
https://www.dnsdb.io/zh-cn/ Requires registration

Syntax: domain:baihe.com type:A

https://securitytrails.com/ Very powerful, requires registration

Just enter the website domain in the search field and press Enter. The "Historical Data" can then be found in the left menu.

Cloudflare's Advice

==A, AAAA, CNAME, or MX records pointing to your origin will expose your original IP.==

So you can check the DNS resolution records corresponding to the domain.

https://tools.ipip.net/cdn.php Requires registration

3. Use Foreign Hosts for Direct Detection#

Another method, if you don't have foreign hosts, is to use public multi-location ping services. Multi-location ping services have foreign detection nodes, and you can use ==the ICMP response information returned from foreign nodes== to determine the real IP information.

Foreign node ping addresses

https://ping.chinaz.com/

http://www.webkaka.com/Ping.aspx

4. Check the Email Server IP from Emails Received#

RSS email subscriptions. Many websites come with sendmail, which will send emails to us. At this time, checking the email source code will include the server's real IP.
If the target system has a mailing function, it usually sends emails during user registration/password recovery, etc. By checking the original email sent by the system, you can view the sender's IP address.

DNS's MX records (see point 2 above).

5. Certificate Query#

https://crt.sh/

The principle is to send a client hello to the IP's 443 port. The server replies with a server hello that contains the SSL certificate, and the ==common name in the SSL certificate contains domain information==. This way, you can know the domain that resolves to this IP. So more accurately, the IP's 443 port may expose the domain.

https://search.censys.io/# Check historical certificates.

Syntax:

parsed.names: 4399.com and tags.raw: trusted
Only show valid certificate query parameters: tags.raw: trusted

Censys will show you all standard certificates that meet the above search criteria. The above certificates were found during scanning.

Just click on any certificate.

6. Use zmap to Capture Target IP Segment 80 Banner Information#

Randomly scan 10,000 IPs on port 80.

zmap -B 10M -p 80 -n 10000 -o results.csv

Loop through the obtained IPs and use curl to print out the banner.

for i in `zmap -B 10M -p 80 -n 10000`; do curl -s -I "$i" >> out1; done

Then match the target domain's ==same banner==; that IP is the real IP.

7. Domain Tweaking#

In the past, when using CDN, there was a habit of only allowing the WWW domain to use CDN, while the naked domain did not use it, to make it more convenient to maintain the website without waiting for CDN caching. So try removing the www from the target website and ping to see if the IP changes.

If you have obtained the target website administrator's account in CDN, you can find the website's real IP in the CDN configuration.

2. Side Site Information Collection#

Side sites are different websites on the ==same server as the attack target==. When the attack target has no vulnerabilities, you can find vulnerabilities in the side sites, attack the side sites, and then escalate privileges to gain the highest permissions on the server.

nmap port scanning

nmap -sV -p- real_ip -v -oN xxx.txt

Online query websites

https://www.webscan.cc/

https://stool.chinaz.com/same

3. C-Class Information Collection#

C-class hosts refer to servers that are ==in the same C-class network as the target server==. The live hosts in the target's C-class are important information for information gathering. Many internal servers of units and enterprises may be in the same C-class network.

nmap

nmap -sn real_IP/24 -v -oN xxx.txt

-n (do not use domain name resolution)
Tells Nmap to never perform reverse DNS resolution on the active IP addresses it discovers. Since DNS is generally slow, this can speed things up.

Use Google, syntax: site:125.125.125.*

4. Subdomain Information Collection#

01. Subdomain Bruteforce Tools#

==AllinOne== https://github.com/shmilylty/OneForAll

A Python tool, OneForAll requires a version higher than Python 3.6.0 to run. OneForAll will generate corresponding results in the results directory upon normal execution with default parameters. Install dependencies before use: pip install -r requirements.txt.

python3 oneforall.py --target example.com run
python3 oneforall.py --targets ./example.txt run

JSFinder See below ==8.06==.
ESD (Download from GitHub, but I encountered errors using it).

# Scan a single domain
esd -d qq.com

subfinder (Download from GitHub, requires Go language).

subfinder -d hackerone.com

Used with httpx, it can find running HTTP servers (httpx is written in Go).

echo 4399.com | subfinder -silent | httpx -ip > subdomain_list
httpx --silent only outputs the domain.

02. Online Query Websites#

==Search Engines to Discover Subdomains==

Baidu Search Engine
site:baidu.com

Google Search Engine
site:baidu.com

https://fofa.info/
https://www.shodan.io/
https://x.threatbook.com/v5/mapping

https://dnsdumpster.com/

https://www.dnsdb.io/zh-cn/ Useful but requires membership for extensive use

Input baidu.com type:A.

5. Determine Operating System Type and Version#

nmap

nmap -O 192.168.88.21

Check if the website URL is case-sensitive (not case-sensitive is Windows, otherwise Linux).
Windows TTL value is generally 128 (or >100), while Linux is 64.

6. Website Owner Information Collection#

Helpful for dictionary creation.

01. whois#

Whois (pronounced "Who is", not an abbreviation) is a protocol used to query domain IP and owner information. In simple terms, whois is a database used to check whether a domain has been registered and to provide detailed information about the registered domain (such as ==domain owner, domain registrar==). Whois is used to query domain information. Early whois queries were mostly command-line interfaces, but now some web interfaces have emerged to simplify online query tools, allowing queries to multiple databases at once. Web interface query tools still rely on the whois protocol to send query requests to servers, while command-line interface tools are still widely used by system administrators. Whois typically uses the TCP protocol on port 43. Each domain/IP's whois information is maintained by the corresponding management organization.

==The WHOIS information for each domain or IP is maintained by the corresponding management organization==. For example, the WHOIS information for .com domains is managed by the .com domain operator VeriSign, while the national top-level domain .cn in China is managed by CNNIC.

https://whois.chinaz.com/cnblogs.com

https://whois365.com/tw/domain/burnchi.site

Assuming we have obtained information through the target's colleagues, such as the target's real name, contact information, work hours, etc. *A skilled social engineer will organize, classify, and filter the information to construct a carefully prepared trap, allowing the target to walk into it.*

03. Personal Information Retained by Official Websites#

Generally, companies will place official contact information on their official websites, which can be used to collect email and phone information.

04. Recruitment Information Collection#

Recruitment information on job websites contains a lot of personnel-related information. Recruitment information involves electronic mail, phone numbers, and other related information of the recruited personnel, while job seekers' resumes contain very detailed personal information such as names, phone numbers, emails, and work experience. If there are security vulnerabilities on the recruitment website, job seekers' resumes may be leaked.

05. ICP Filing Information#

Know company information, filing review time.

https://icp.chinaz.com/

https://beian.miit.gov.cn/#/Integrated/index

06. Exposed Locations#

The same effect as point 2 above.

View individual certificate information.

https://crt.sh/

https://search.censys.io/#

07. Check Company Information#

https://www.qcc.com/

https://www.tianyancha.com/

https://tool.chinaz.com/

08. Obtain Email Information#

https://www.skymem.info/

09. Others#

(1) Look for usernames directly on the web (as they generally have emails, you can get usernames based on company names or numbers to generate corresponding dictionaries).
(2) Use Google syntax to search for xlsx, etc., or directly search for this company-related information, which may reveal usernames.
(3) Check GitHub for this company to see if there are any leaks.
(4) Look for interviewers on job websites, as they may leak phone numbers and usernames, and check usernames based on phone numbers.
(5) Search for the company's organizational chart and note down any leaders.
(6) Use public accounts, Weibo, and other social media to search for company information.
(7) Use Baidu Images (this depends on luck; sometimes web searches yield too many results, so directly looking at Baidu Images may reveal usernames quickly; I thought of this when I needed to find a number during a previous attack-defense exercise, but the number was too blurred to see clearly).
(8) Look for commonly used username dictionaries for collection.

7. Identify CMS#

A Content Management System (CMS) is a system for managing website content. CMSs have many ==template-based excellent designs==, ==which can speed up website development and reduce development costs==. The functionality of a CMS is not limited to text processing; it can also handle images, Flash animations, audio and video streams, graphics, and even email archives. CMS is actually a broad term that can refer to anything from general blog programs and news publishing programs to comprehensive website management programs.

01. Manual Identification#

==The footer may expose the CMS==

powered by ...

==robots.txt file==

Determine this CMS through a specific path.

==Response Header Information==

cookie section.

==Website Backend==

The website's backend login interface also has characteristic codes of the CMS.

Determine based on URL routing, such as wp-admin.

02. Fingerprint Recognition Tools#

The main development idea: establish a connection with request --- obtain webpage content --- use regular expressions to match keywords --- identify CMS type.

==Chrome Extension -- Wappalyzer==

Common tools include CMSeek.

https://github.com/urbanadventurer/WhatWeb

03. Online CMS Recognition Websites#

http://whatweb.bugscaner.com/look/

8. Identify Web Middleware#

Response headers.
Determine based on error messages.
Determine based on default pages.

9. Internet Asset Collection#

Includes historical vulnerability information, GitHub source code leaks, SVN source code information, leaked cloud disk file information, etc.

01. Historical Vulnerability Information#

Google search for relevant software vulnerabilities.

02. GitHub Source Code Information Leaks#

GitHub is a hosting platform for open-source and private software projects, and many people like to upload their code to the platform. ==Attackers can search using keywords== to find ==sensitive information about the target site==, and even download the website source code.

When developers use git for version control, after initializing a repository in a directory, a hidden folder named .git is created in that directory, which contains all versions and a series of information about the repository. ==If the server places the .git folder in the web directory==, it may allow attackers to obtain all source code of the application using the information inside the .git folder.

GitHub syntax search

`in:name`	vue in Matches repositories containing "jquery" in their names.
`in:description`	vue in,description Matches repositories containing "vue" in their names or descriptions.
`in:readme`	vue in Matches repositories mentioning "vue" in their readme files.
`repo:owner/name`	repo/blog Matches specific repository names, such as the blog project of user biaochenxuying.

For more details on search syntax, see

https://github.com/FrontEndGitHub/FrontEndGitHub/issues/4

GitHack, to pull source code.

lijiejie/GitHack

A `.git` folder disclosure exploit

Python3480813

03. Backup Site Compressed Packages#

Attempt to obtain through directory scanning.

04. SVN#

You can use the .svn/entries file to obtain server source code, SVN server account passwords, and other information. A more serious issue is that the .svn directory generated by SVN also contains source code file copies ending with .svn-base (for lower versions of SVN, the specific path is the text-base directory, while for higher versions, it is the pristine directory). If the server does not parse such suffixes, hackers can directly obtain the source code files.

Details

https://cloud.tencent.com/developer/article/1376492

Source code restoration tool

admintony/svnExploit

SvnExploit支持SVN源代码泄露全版本Dump源码

Python1020174

05. DNS Information Leaks#

A. MX Record Leaks

https://dnsdumpster.com/

https://www.robtex.com/

https://mxtoolbox.com/

06. API Leaks#

JSFinder Tampermonkey Script

07. Other Sensitive Files#

First check which CMS is being used, and then scan according to that CMS's directory structure.

If no CMS is used, use conventional sensitive file name dictionaries for scanning, such as:

robots.txt
crossdomain.xml
sitemap.xml
xx.tar.gz
xx.bak
phpinfo

08. Cloud Disk Search#

Lingfengyun Search

https://www.lingfengyun.com/

Xiaobaipan Search

Address: https://www.xiaobaipan.com/

Dali Pan Search

Address: https://www.dalipan.com/

Xiaobudian Search (Weipan)

Address: https://www.xiaoso.net/

Baidu Cloud Disk Crawling Open Source Tool

Address: https://github.com/gudegg/yunSpider

Google search for relevant middleware information leaks.

10. WAF Identification#

WAF Functions

Find WAF by looking at the image.

https://blog.csdn.net/weixin_46676743/article/details/112245605

Tools

WAFW00f

EnableSecurity/wafw00f

WAFW00F allows one to identify and fingerprint Web Application Firewall (WAF) products protecting a website.

Python60341014

Or manually input incorrect URIs and SQL statements, and XSS to see if you can trigger WAF alerts.

nmap -p 80 --script http-waf-detect.nse 4399.com

11. Port Scanning#

Method for scanning all ports.

nmap is slow.

nmap -sV -Pn -p- 1.1.1.1 -oX result.xml

masscan is fast but sometimes inaccurate.

masscan --open --banners -p- 1.1.1.1 --rate 1000 -oX result.xml

Common Port Vulnerability Information Table

Port Number	Service	Attack Methods
21/22/69	ftp/tftp	Brute force, sniffing, overflow, backdoor
22	ssh	Brute force, 28 backspaces
23	telnet	Brute force, sniffing
25	smtp	Email forgery, brute force
53	dns	DNS zone transfer, DNS hijacking, DNS cache poisoning, DNS spoofing, DNS tunneling

67/68	dhcp	Hijacking, spoofing
110	pop3	Brute force
139	samba	Brute force, unauthorized, remote code execution
143	imap	Brute force
161	snmp	Brute force
389	ldap	Injection, unauthorized
512/513/514	linux r	Directly use rlogin
873	rsync	Unauthorized
1080	socket	Brute force, internal penetration
1352	lotus	Brute force, weak passwords, information leakage (source code)
1433	mssql	Brute force, injection
1521	oracle	Brute force, injection, TNS remote poisoning
2049	nfs	Misconfiguration
2181	zookeeper	Unauthorized
3306	mysql	Brute force, injection, denial of service
3389	rdp	Brute force, shift backdoor
4848	glassfish	Brute force, console weak passwords, authentication bypass
5000	sybase/db2	Brute force, injection
5432	postgreSQL	Brute force, weak passwords, injection, buffer overflow
5632	pcanywhere	Denial of service, code execution
6379	redis	Unauthorized, brute force, weak passwords
7001	weblogic	Deserialization, console weak passwords, console deployment webshell
8069	zabbix	Remote command execution
8080-8090	web	Common web attacks, brute force, middleware vulnerabilities, CMS version vulnerabilities
9090	websphere	Brute force, console weak passwords, deserialization
9200/9300	elasticsearch	Remote code execution
11211	memcached	Unauthorized
27017	mongoDB	Brute force, unauthorized

12. Directory Scanning#

gobuster https://github.com/OJ/gobuster
dirsearch https://github.com/maurosoria/dirsearch

13. APP Information Collection#

kenuoseclab/AppInfoScanner

一款适用于(Android、iOS、WEB、H5、静态网站)，信息检索的工具，可以帮助渗透测试人员快速获取App或者WEB中的有用资产信息。

Python388

14. Other Information Collection Channels#

Zhihu
Tieba
Social Engineering Database
Telegram

Send any email to http://tool.chacuo.net/mailanonymous and https://emkei.cz/
Temporary email http://www.yopmail.com/
Email pool group http://veryvp.com/

Methods to Prevent Information Gathering#

If website administrators want to prevent their websites from being subjected to preliminary information gathering by hackers, they can modify the webpage's characteristic information:
(1) Modify webpage display information (webpage templates, technical support, keywords, version information, backend login module information, etc.)
(2) Modify webpage path information (/robots, /admin, etc.)
(3) Personalize webpage information; modify webpage path information can be done by using pinyin abbreviations or personalized methods to hide the generic path names of the website building vendors, for example, changing /admin to /a8min.