Content Discovery TryHackMe

Overview

This room covers various techniques for discovering hidden content, directories, and files on web applications. Essential skills for web application security assessment and penetration testing.

Task 1: What Is Content Discovery?

Content discovery finds hidden or unlinked content on web applications, including directories, files, and endpoints not immediately visible to users.

Content types: files, videos, pictures, backup files, website features, administration portals, staff-only areas.

Three main discovery methods:

Manually: Manual inspection and analysis
Automated: Using tools and scripts
OSINT: Open-Source Intelligence gathering

Task 2: Manual Discovery - Robots.txt

The robots.txt file tells search engines which pages to crawl or avoid. It often reveals restricted areas like administration portals or customer files that owners don't want discovered.

Task 3: Manual Discovery - Favicon

The favicon can reveal framework information when frameworks leave default icons. OWASP hosts a database of common framework icons at https://wiki.owasp.org/index.php/OWASP_favicon_database.

Practical Example - Favicon Analysis:

1. Download and Hash:

Download: curl https://target.com/images/favicon.ico
Hash: md5sum favicon.ico
Compare against OWASP database

2. Framework Identification:

Use hash to identify framework
Research framework vulnerabilities
Look for default paths

Task 4: Manual Discovery - Sitemap.xml

The sitemap.xml file lists every file the website owner wants indexed by search engines. It can reveal areas that are difficult to navigate or old webpages still working behind the scenes.

Task 5: Manual Discovery - HTTP Headers

HTTP headers can reveal server software, programming languages, and custom information useful for finding vulnerabilities.

Practical Example - Header Analysis:

1. View Headers:

Use: curl http://target.com -v
Look for server and version info
Check for custom headers

2. Analyze Information:

Identify web server and version
Check programming language info
Search for known vulnerabilities

TryHackMe Example:

Web server: NGINX 1.18.0
PHP version: 7.4.3
Custom header: X-FLAG with THM{HEA***_****}

Task 6: Manual Discovery - Framework Stack

Once you've established the framework of a website, either from the favicon example or by looking for clues in the page source such as comments, copyright notices or credits, you can then locate the framework's website. From there, we can learn more about the software and other information, possibly leading to more content we can discover.

Practical Example - Framework Analysis:

1. Identify Framework:

Check page source for framework comments
Look for copyright notices or credits
Search for framework-specific file paths
Check for framework documentation links

2. Research Framework:

Visit framework's official website
Look for default paths and configurations
Check for known vulnerabilities
Find administration portal paths

TryHackMe Example:

Framework found in page source comments
Framework website: https://static-labs.tryhackme.cloud/sites/thm-web-framework
Administration portal path discovered
Flag found: THM{CHA***_*******_***********}

Task 7: OSINT - Google Hacking / Dorking

Google hacking uses advanced search operators to pick out custom content. For example, (site:tryhackme.com admin) returns results from tryhackme.com containing "admin".

Common Google Search Operators:

Filter	Example	Description
site	site:tryhackme.com	Returns results only from the specified website address
inurl	inurl:admin	Returns results that have the specified word in the URL
filetype	filetype:pdf	Returns results which are a particular file extension
intitle	intitle:admin	Returns results that contain the specified word in the title

TryHackMe Resources:

More information about Google hacking: https://en.wikipedia.org/wiki/Google_hacking
Advanced search operators for discovering sensitive information
Combining multiple filters for precise content discovery

Task 8: OSINT - Wappalyzer

Wappalyzer (https://www.wappalyzer.com/) is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.

Wappalyzer Features:

Framework Detection: Identifies web application frameworks
CMS Detection: Detects content management systems
JavaScript Libraries: Identifies client-side libraries and frameworks
Server Information: Reveals web server and hosting details
Version Numbers: Provides specific version information
Analytics Tools: Shows tracking and analytics services

TryHackMe Resources:

Wappalyzer website: https://www.wappalyzer.com/
Browser extension available for Chrome, Firefox, and other browsers
Online tool for quick technology stack analysis

Task 9: OSINT - Wayback Machine

The Wayback Machine (https://archive.org/web/) is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.

Wayback Machine Usage:

Historical Content Discovery: Find old versions of web pages
Vulnerable Versions: Identify older, potentially vulnerable software versions
Removed Functionality: Discover features that were removed but may still be accessible
Backup Endpoints: Find old backup files or endpoints
Change Analysis: Track how the website has evolved over time

TryHackMe Resources:

Wayback Machine website: https://archive.org/web/
Search by domain name to view historical snapshots
Can reveal forgotten or deprecated functionality
Useful for finding old backup files and endpoints

Task 10: OSINT - GitHub

GitHub is a hosted version of Git on the internet. Repositories can be public or private with various access controls. Use GitHub's search feature to look for company names or website names to locate repositories belonging to your target.

GitHub for OSINT:

Source Code Analysis: Examine application source code for vulnerabilities
Configuration Files: Find exposed configuration files with credentials
Backup Files: Discover backup files and documentation
Developer Comments: Find hints and information in code comments
Repository Search: Search for company or website-related repositories
Commit History: Analyze changes and potential security issues

TryHackMe Resources:

GitHub search functionality for finding target-related repositories
Public repositories may contain sensitive information
Search for company names, domain names, or project names
Look for exposed credentials in configuration files

Task 11: OSINT - S3 Buckets

S3 Buckets are a storage service provided by Amazon AWS, allowing people to save files and even static website content in the cloud accessible over HTTP and HTTPS. The owner of the files can set access permissions to either make files public, private and even writable. Sometimes these access permissions are incorrectly set and inadvertently allow access to files that shouldn't be available to the public. The format of the S3 buckets is http(s)://{name}.s3.amazonaws.com where {name} is decided by the owner, such as tryhackme-assets.s3.amazonaws.com. S3 buckets can be discovered in many ways, such as finding the URLs in the website's page source, GitHub repositories, or even automating the process. One common automation method is by using the company name followed by common terms such as {name}-assets, {name}-www, {name}-public, {name}-private, etc.

S3 Bucket Discovery:

URL Format: http(s)://{name}.s3.amazonaws.com
Common Naming Patterns: {name}-assets, {name}-www, {name}-public, {name}-private
Discovery Methods: Page source analysis, GitHub repositories, automation
Access Permissions: Public, private, or writable configurations
Exposed Content: Configuration files, backup data, application source code

TryHackMe Resources:

S3 buckets can be discovered through various methods
Misconfigured permissions can expose sensitive data
Common naming patterns help in discovery
Always check for exposed configuration files and credentials

Task 12: Automated Discovery

Automated discovery uses tools to discover content rather than doing it manually. This process contains hundreds, thousands or even millions of requests to a web server to check whether files or directories exist.

What are Wordlists?

Wordlists are text files containing commonly used words for different use cases. For content discovery, we need lists containing common directory and file names. An excellent resource is https://github.com/danielmiessler/SecLists.

Automation Tools:

We'll cover three preinstalled tools: ffuf, dirb and gobuster.

Using ffuf:

ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://MACHINE_IP/FUZZ

Using dirb:

dirb http://MACHINE_IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

Using Gobuster:

gobuster dir --url http://MACHINE_IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

TryHackMe Resources:

SecLists wordlist repository: https://github.com/danielmiessler/SecLists
ffuf: Fast web fuzzer for content discovery
dirb: Directory brute-forcing tool
gobuster: Multi-purpose tool for directory and DNS enumeration
Wordlists contain common directory and file names for brute-forcing

PT1 Exam Relevance

This room covers essential content discovery skills for PT1 certification:

Reconnaissance Techniques: Fundamental for web application assessment
Information Gathering: Critical for comprehensive security testing
OSINT Skills: Essential for modern penetration testing
Manual Discovery: Complements automated tools
Technology Identification: Helps understand attack surface

Essential Content Discovery Checklist

CRITICAL: Always check these key areas during content discovery:

robots.txt: Check for restricted areas and hidden directories
sitemap.xml: Find indexed pages and old webpages
HTTP Headers: Use curl http://target.com -v to reveal server info
Favicon: Download and hash to identify frameworks
Page Source: Look for comments, framework clues, and hardcoded paths
Wappalyzer: Identify technologies and versions
Wayback Machine: Check for historical content and old endpoints
GitHub: Search for exposed source code and credentials
S3 Buckets: Check for misconfigured cloud storage
Google Dorking: Use advanced search operators for content discovery

Key Takeaways

Content discovery is essential for comprehensive web application testing
Manual techniques complement automated tools
OSINT provides valuable intelligence before active testing
Understanding technology stack helps identify vulnerabilities
Historical data can reveal forgotten vulnerabilities
Always check for misconfigurations in cloud services