I recently needed to search through a collection of 20,000 text files. After trying basic tools like find
and grep
, I decided to give sqlite3 a try.
Data Preparation
Using a Jupyter Notebook, I cleaned the documents and imported them into a sqlite3 database. I used Beautiful Soup to extract text from both HTML and plain text documents. The entire cleaning process took around 5 hours. Once I had everything figured out, the actual database import step took about 30 seconds.
Full Text Search with FTS5
I created an FTS5 virtual table to enable fast full-text search with boolean logic and fuzzy matching. This allowed me to quickly find relevant files.
I wrote a second Jupyter Notebook to provide a simple notebook cell based UI for querying and retrieving documents. That worked, but I wanted to make things even more convenient.
Web-Based Search UI with Flask
I built a simple web-based search UI using Flask. To create and debug the Flask server, I used Visual Studio Code with their excellent Flask tutorial.
Lessons Learned
- Jupyter Notebooks are a useful tool for data cleaning and import.
- sqlite and FTS5 works well for simple document search applications.
- Flask is an easy-to-use library for simple web servers.
- Visual Studio Code is great for creating and debugging Flask servers.
Room for Improvement
- The UI could be improved by using a design system like Bootstrap.
Overall, I was able to create a functional search tool that can handle a my collection of text files. It took about 10 hours of work, and I learned a few things along the way.
Two thumbs up, would hack again!