πΊοΈ Sitemap Generator
π Method 1: SearXNG-Powered Discovery RECOMMENDED
Best for: Modern websites with JavaScript frameworks, SPAs, and sites where traditional crawling fails.
β¨ Pro Tip: You can now search specific paths! Use "example.com/blog/" to find only URLs under that specific directory, or just "example.com" for the whole domain.
Method 2: Traditional Web Crawling
Crawl a website by following links. Works well for traditional HTML sites.
Note: This method only works well with traditional HTML sites. Modern JavaScript frameworks may not be fully discovered.
Method 3: Parse Existing Sitemap
Extract URLs from an existing sitemap.xml file.
Tip: Most sites have sitemaps at /sitemap.xml or /sitemap_index.xml
Usage Tips:
- For ArchiveBox: Save the generated file and use
archivebox add < sitemap.txt
- Try SearXNG first: The new SearXNG method works much better with modern websites
- Respect robots.txt: Check the site's robots.txt before crawling
- Be patient: All methods include delays to be respectful to servers
- Start small: Test with a low number first, then increase if needed
Which Method Should I Use?
- π SearXNG Method: Best for any modern website, especially if you've had issues with traditional crawling
- π·οΈ Traditional Crawling: Good for simple HTML sites and when you want to follow the exact link structure
- π Sitemap Parsing: Perfect when a site already has a comprehensive sitemap.xml