Using Search Engines to Uncover Sensitive Data on the Web

Search engines are powerful tools that can really help you with your work. Apart from finding useful resources and interesting articles, a search engine can be used for other practical purposes. Specifically, if you don’t know which word to use exactly in the phrase you are writing down, or if you do not remember how to spell it, you can use a search engine to find an answer. You can also search for specific terms within a web page and make sure if they exist or not. This is possible by utilizing a plethora of search operators that engines like Google provide to web users. For instance, by placing the following string in the Google search engine, the results will include all pages in the ACM digital library that contain the word “injection” within: "injection"

Nevertheless, a search engine can be also used for malicious purposes like finding weak systems, confidential files and FTP defects. In fact, the hacker community is actively using such engines as the tools of their trade. How could this be possible? As Mitnick stated in his book “The Art of Deception”: “the human factor is truly security’s weakest link”. Sometimes either users or developers accidentally upload sensitive data on easily accessible online storages or services (i.e. FTP servers, web-based hosting services and others). Yet, search engines index such online entities.  Specifically, Google scans new domain names, and infers from a name like or that there may be an HTTP or FTP server responding there, and thus worth indexing. Actually Bing performs URL/port probing. In essence, the engine searches for possible files that may not exist and excuses that as ‘beta testing’ (Check out some administrator’s complaints about this here).

Given that, you can practically search for all excel files (.xls) in all FTP servers that may contain the word “password” inside, by placing the following string in the Google search engine:

inurl:ftp "password" filetype:xls

You will be surprised with the results. There are thousands of excel files that contain usernames and passwords used to access forums, web services, routres, wireless networks an others. Consider the following search string:

mysql_connect OR mysql_pconnect filetype:inc OR filetype:bak OR filetype:old

By using this string you can search for common file extensions (.inc, .bak, .old) used as backups by PHP developers. Such extensions are normally not interpreted as code by the server, so their database connection credentials can be viewed. You will get more than 70000 results with this query.

Github is a hosting service for software projects that use the Git revision control system. By passing the following string to the Google search engine you can find disclosed FTP login credentials within github repositories: inurl:sftp-config.json

You can make your search string more specific and find similar credentials for WordPress installs, in the following manner: inurl:sftp-config.json intext:/wp-content/

Obviously, you must be very careful with what you upload and where. If you maintain a web page, you have to set up your robots.txt file with caution. In essence, the Robot Execution Standard or robots.txt protocol, is a convention to tell web crawlers and other web robots which files or directories of a web page to index and which ones to avoid. You can read more about Google hacking in the homonymous book. You can also refer to this interesting article. In addition, there is this database that contains hundreds of attack patterns.

I have to mention here that actively exploiting such vulnerabilities without the permission of those affected is unethical and illegal. Passively looking for vulnerabilities in a controlled environment is probably legal in some countries but in other jurisdictions you can end up in jail even for that. If you want to read more about ethical hacking you can refer to this book.

Leave a Reply

Your email address will not be published. Required fields are marked *