Overview
This room covers various techniques for discovering hidden content, directories, and files on web applications. Essential skills for web application security assessment and penetration testing.
Task 1: What Is Content Discovery?
Content discovery finds hidden or unlinked content on web applications, including directories, files, and endpoints not immediately visible to users.
Content types: files, videos, pictures, backup files, website features, administration portals, staff-only areas.
Three main discovery methods:
- Manually: Manual inspection and analysis
- Automated: Using tools and scripts
- OSINT: Open-Source Intelligence gathering
Task 2: Manual Discovery - Robots.txt
The robots.txt file tells search engines which pages to crawl or avoid. It often reveals restricted areas like administration portals or customer files that owners don't want discovered.
Task 3: Manual Discovery - Favicon
The favicon can reveal framework information when frameworks leave default icons. OWASP hosts a database of common framework icons at https://wiki.owasp.org/index.php/OWASP_favicon_database.
Practical Example - Favicon Analysis:
1. Download and Hash:
- Download:
curl https://target.com/images/favicon.ico
- Hash:
md5sum favicon.ico
- Compare against OWASP database
2. Framework Identification:
- Use hash to identify framework
- Research framework vulnerabilities
- Look for default paths
Task 4: Manual Discovery - Sitemap.xml
The sitemap.xml file lists every file the website owner wants indexed by search engines. It can reveal areas that are difficult to navigate or old webpages still working behind the scenes.
Task 5: Manual Discovery - HTTP Headers
HTTP headers can reveal server software, programming languages, and custom information useful for finding vulnerabilities.
Practical Example - Header Analysis:
1. View Headers:
- Use:
curl http://target.com -v
- Look for server and version info
- Check for custom headers
2. Analyze Information:
- Identify web server and version
- Check programming language info
- Search for known vulnerabilities
TryHackMe Example:
- Web server: NGINX 1.18.0
- PHP version: 7.4.3
- Custom header: X-FLAG with THM{HEADER_FLAG}
Task 6: Manual Discovery - Framework Stack
Once you've established the framework of a website, either from the favicon example or by looking for clues in the page source such as comments, copyright notices or credits, you can then locate the framework's website. From there, we can learn more about the software and other information, possibly leading to more content we can discover.
Practical Example - Framework Analysis:
1. Identify Framework:
- Check page source for framework comments
- Look for copyright notices or credits
- Search for framework-specific file paths
- Check for framework documentation links
2. Research Framework:
- Visit framework's official website
- Look for default paths and configurations
- Check for known vulnerabilities
- Find administration portal paths
TryHackMe Example:
- Framework found in page source comments
- Framework website: https://static-labs.tryhackme.cloud/sites/thm-web-framework
- Administration portal path discovered
- Flag found: THM{CHANGE_DEFAULT_CREDENTIALS}
Task 7: OSINT - Google Hacking / Dorking
Google hacking uses advanced search operators to pick out custom content. For example, (site:tryhackme.com admin) returns results from tryhackme.com containing "admin".
Common Google Search Operators:
| Filter |
Example |
Description |
| site |
site:tryhackme.com |
Returns results only from the specified website address |
| inurl |
inurl:admin |
Returns results that have the specified word in the URL |
| filetype |
filetype:pdf |
Returns results which are a particular file extension |
| intitle |
intitle:admin |
Returns results that contain the specified word in the title |
TryHackMe Resources:
- More information about Google hacking: https://en.wikipedia.org/wiki/Google_hacking
- Advanced search operators for discovering sensitive information
- Combining multiple filters for precise content discovery
Task 8: OSINT - Wappalyzer
Wappalyzer (https://www.wappalyzer.com/) is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.
Wappalyzer Features:
- Framework Detection: Identifies web application frameworks
- CMS Detection: Detects content management systems
- JavaScript Libraries: Identifies client-side libraries and frameworks
- Server Information: Reveals web server and hosting details
- Version Numbers: Provides specific version information
- Analytics Tools: Shows tracking and analytics services
TryHackMe Resources:
- Wappalyzer website: https://www.wappalyzer.com/
- Browser extension available for Chrome, Firefox, and other browsers
- Online tool for quick technology stack analysis
Task 9: OSINT - Wayback Machine
The Wayback Machine (https://archive.org/web/) is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.
Wayback Machine Usage:
- Historical Content Discovery: Find old versions of web pages
- Vulnerable Versions: Identify older, potentially vulnerable software versions
- Removed Functionality: Discover features that were removed but may still be accessible
- Backup Endpoints: Find old backup files or endpoints
- Change Analysis: Track how the website has evolved over time
TryHackMe Resources:
- Wayback Machine website: https://archive.org/web/
- Search by domain name to view historical snapshots
- Can reveal forgotten or deprecated functionality
- Useful for finding old backup files and endpoints
Task 10: OSINT - GitHub
GitHub is a hosted version of Git on the internet. Repositories can be public or private with various access controls. Use GitHub's search feature to look for company names or website names to locate repositories belonging to your target.
GitHub for OSINT:
- Source Code Analysis: Examine application source code for vulnerabilities
- Configuration Files: Find exposed configuration files with credentials
- Backup Files: Discover backup files and documentation
- Developer Comments: Find hints and information in code comments
- Repository Search: Search for company or website-related repositories
- Commit History: Analyze changes and potential security issues
TryHackMe Resources:
- GitHub search functionality for finding target-related repositories
- Public repositories may contain sensitive information
- Search for company names, domain names, or project names
- Look for exposed credentials in configuration files
Task 11: OSINT - S3 Buckets
S3 Buckets are a storage service provided by Amazon AWS, allowing people to save files and even static website content in the cloud accessible over HTTP and HTTPS. The owner of the files can set access permissions to either make files public, private and even writable. Sometimes these access permissions are incorrectly set and inadvertently allow access to files that shouldn't be available to the public. The format of the S3 buckets is http(s)://{name}.s3.amazonaws.com where {name} is decided by the owner, such as tryhackme-assets.s3.amazonaws.com. S3 buckets can be discovered in many ways, such as finding the URLs in the website's page source, GitHub repositories, or even automating the process. One common automation method is by using the company name followed by common terms such as {name}-assets, {name}-www, {name}-public, {name}-private, etc.
S3 Bucket Discovery:
- URL Format: http(s)://{name}.s3.amazonaws.com
- Common Naming Patterns: {name}-assets, {name}-www, {name}-public, {name}-private
- Discovery Methods: Page source analysis, GitHub repositories, automation
- Access Permissions: Public, private, or writable configurations
- Exposed Content: Configuration files, backup data, application source code
TryHackMe Resources:
- S3 buckets can be discovered through various methods
- Misconfigured permissions can expose sensitive data
- Common naming patterns help in discovery
- Always check for exposed configuration files and credentials
Task 12: Automated Discovery
Automated discovery uses tools to discover content rather than doing it manually. This process contains hundreds, thousands or even millions of requests to a web server to check whether files or directories exist.
What are Wordlists?
Wordlists are text files containing commonly used words for different use cases. For content discovery, we need lists containing common directory and file names. An excellent resource is https://github.com/danielmiessler/SecLists.
Automation Tools:
We'll cover three preinstalled tools: ffuf, dirb and gobuster.
Using ffuf:
ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://MACHINE_IP/FUZZ
Using dirb:
dirb http://MACHINE_IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
Using Gobuster:
gobuster dir --url http://MACHINE_IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
TryHackMe Resources:
- SecLists wordlist repository: https://github.com/danielmiessler/SecLists
- ffuf: Fast web fuzzer for content discovery
- dirb: Directory brute-forcing tool
- gobuster: Multi-purpose tool for directory and DNS enumeration
- Wordlists contain common directory and file names for brute-forcing
PT1 Exam Relevance
This room covers essential content discovery skills for PT1 certification:
- Reconnaissance Techniques: Fundamental for web application assessment
- Information Gathering: Critical for comprehensive security testing
- OSINT Skills: Essential for modern penetration testing
- Manual Discovery: Complements automated tools
- Technology Identification: Helps understand attack surface
Essential Content Discovery Checklist
CRITICAL: Always check these key areas during content discovery:
- robots.txt: Check for restricted areas and hidden directories
- sitemap.xml: Find indexed pages and old webpages
- HTTP Headers: Use
curl http://target.com -v to reveal server info
- Favicon: Download and hash to identify frameworks
- Page Source: Look for comments, framework clues, and hardcoded paths
- Wappalyzer: Identify technologies and versions
- Wayback Machine: Check for historical content and old endpoints
- GitHub: Search for exposed source code and credentials
- S3 Buckets: Check for misconfigured cloud storage
- Google Dorking: Use advanced search operators for content discovery
Key Takeaways
- Content discovery is essential for comprehensive web application testing
- Manual techniques complement automated tools
- OSINT provides valuable intelligence before active testing
- Understanding technology stack helps identify vulnerabilities
- Historical data can reveal forgotten vulnerabilities
- Always check for misconfigurations in cloud services