Content Discovery

Date: 06-02-2025- Platform: TryHackMe- Difficulty: Easy- PT1 Exam Preparation

Overview

This room covers various techniques for discovering hidden content, directories, and files on web applications. Essential skills for web application security assessment and penetration testing.

Task 1: What Is Content Discovery?

Content discovery finds hidden or unlinked content on web applications, including directories, files, and endpoints not immediately visible to users.

Content types: files, videos, pictures, backup files, website features, administration portals, staff-only areas.

Three main discovery methods:

Task 2: Manual Discovery - Robots.txt

The robots.txt file tells search engines which pages to crawl or avoid. It often reveals restricted areas like administration portals or customer files that owners don't want discovered.

Task 3: Manual Discovery - Favicon

The favicon can reveal framework information when frameworks leave default icons. OWASP hosts a database of common framework icons at https://wiki.owasp.org/index.php/OWASP_favicon_database.

Practical Example - Favicon Analysis:

1. Download and Hash:

2. Framework Identification:

Task 4: Manual Discovery - Sitemap.xml

The sitemap.xml file lists every file the website owner wants indexed by search engines. It can reveal areas that are difficult to navigate or old webpages still working behind the scenes.

Task 5: Manual Discovery - HTTP Headers

HTTP headers can reveal server software, programming languages, and custom information useful for finding vulnerabilities.

Practical Example - Header Analysis:

1. View Headers:

2. Analyze Information:

TryHackMe Example:

  • Web server: NGINX 1.18.0
  • PHP version: 7.4.3
  • Custom header: X-FLAG with THM{HEADER_FLAG}

Task 6: Manual Discovery - Framework Stack

Once you've established the framework of a website, either from the favicon example or by looking for clues in the page source such as comments, copyright notices or credits, you can then locate the framework's website. From there, we can learn more about the software and other information, possibly leading to more content we can discover.

Practical Example - Framework Analysis:

1. Identify Framework:

2. Research Framework:

TryHackMe Example:

  • Framework found in page source comments
  • Framework website: https://static-labs.tryhackme.cloud/sites/thm-web-framework
  • Administration portal path discovered
  • Flag found: THM{CHANGE_DEFAULT_CREDENTIALS}

Task 7: OSINT - Google Hacking / Dorking

Google hacking uses advanced search operators to pick out custom content. For example, (site:tryhackme.com admin) returns results from tryhackme.com containing "admin".

Common Google Search Operators:

Filter Example Description
site site:tryhackme.com Returns results only from the specified website address
inurl inurl:admin Returns results that have the specified word in the URL
filetype filetype:pdf Returns results which are a particular file extension
intitle intitle:admin Returns results that contain the specified word in the title

TryHackMe Resources:

  • More information about Google hacking: https://en.wikipedia.org/wiki/Google_hacking
  • Advanced search operators for discovering sensitive information
  • Combining multiple filters for precise content discovery

Task 8: OSINT - Wappalyzer

Wappalyzer (https://www.wappalyzer.com/) is an online tool and browser extension that helps identify what technologies a website uses, such as frameworks, Content Management Systems (CMS), payment processors and much more, and it can even find version numbers as well.

Wappalyzer Features:

TryHackMe Resources:

  • Wappalyzer website: https://www.wappalyzer.com/
  • Browser extension available for Chrome, Firefox, and other browsers
  • Online tool for quick technology stack analysis

Task 9: OSINT - Wayback Machine

The Wayback Machine (https://archive.org/web/) is a historical archive of websites that dates back to the late 90s. You can search a domain name, and it will show you all the times the service scraped the web page and saved the contents. This service can help uncover old pages that may still be active on the current website.

Wayback Machine Usage:

TryHackMe Resources:

  • Wayback Machine website: https://archive.org/web/
  • Search by domain name to view historical snapshots
  • Can reveal forgotten or deprecated functionality
  • Useful for finding old backup files and endpoints

Task 10: OSINT - GitHub

GitHub is a hosted version of Git on the internet. Repositories can be public or private with various access controls. Use GitHub's search feature to look for company names or website names to locate repositories belonging to your target.

GitHub for OSINT:

TryHackMe Resources:

  • GitHub search functionality for finding target-related repositories
  • Public repositories may contain sensitive information
  • Search for company names, domain names, or project names
  • Look for exposed credentials in configuration files

Task 11: OSINT - S3 Buckets

S3 Buckets are a storage service provided by Amazon AWS, allowing people to save files and even static website content in the cloud accessible over HTTP and HTTPS. The owner of the files can set access permissions to either make files public, private and even writable. Sometimes these access permissions are incorrectly set and inadvertently allow access to files that shouldn't be available to the public. The format of the S3 buckets is http(s)://{name}.s3.amazonaws.com where {name} is decided by the owner, such as tryhackme-assets.s3.amazonaws.com. S3 buckets can be discovered in many ways, such as finding the URLs in the website's page source, GitHub repositories, or even automating the process. One common automation method is by using the company name followed by common terms such as {name}-assets, {name}-www, {name}-public, {name}-private, etc.

S3 Bucket Discovery:

TryHackMe Resources:

  • S3 buckets can be discovered through various methods
  • Misconfigured permissions can expose sensitive data
  • Common naming patterns help in discovery
  • Always check for exposed configuration files and credentials

Task 12: Automated Discovery

Automated discovery uses tools to discover content rather than doing it manually. This process contains hundreds, thousands or even millions of requests to a web server to check whether files or directories exist.

What are Wordlists?

Wordlists are text files containing commonly used words for different use cases. For content discovery, we need lists containing common directory and file names. An excellent resource is https://github.com/danielmiessler/SecLists.

Automation Tools:

We'll cover three preinstalled tools: ffuf, dirb and gobuster.

Using ffuf:

ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://MACHINE_IP/FUZZ

Using dirb:

dirb http://MACHINE_IP/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

Using Gobuster:

gobuster dir --url http://MACHINE_IP/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

TryHackMe Resources:

  • SecLists wordlist repository: https://github.com/danielmiessler/SecLists
  • ffuf: Fast web fuzzer for content discovery
  • dirb: Directory brute-forcing tool
  • gobuster: Multi-purpose tool for directory and DNS enumeration
  • Wordlists contain common directory and file names for brute-forcing

PT1 Exam Relevance

This room covers essential content discovery skills for PT1 certification:

Essential Content Discovery Checklist

CRITICAL: Always check these key areas during content discovery:

  • robots.txt: Check for restricted areas and hidden directories
  • sitemap.xml: Find indexed pages and old webpages
  • HTTP Headers: Use curl http://target.com -v to reveal server info
  • Favicon: Download and hash to identify frameworks
  • Page Source: Look for comments, framework clues, and hardcoded paths
  • Wappalyzer: Identify technologies and versions
  • Wayback Machine: Check for historical content and old endpoints
  • GitHub: Search for exposed source code and credentials
  • S3 Buckets: Check for misconfigured cloud storage
  • Google Dorking: Use advanced search operators for content discovery

Key Takeaways