Searching in large files with grep

status
Published
date
Mar 19, 2020
featured_image
slug
searching-in-large-files-with-grep
tags
summary
While writing the Subflix service, I thought about how to search within the subtitles I had. I considered parsing the time and content values from the subtitle files and transferring them to a database, but I gave up on this because I had too many subtitle files.
type
Post
While writing the Subflix service, I thought about how to search within the subtitles I had. I considered parsing the time and content values from the subtitle files and transferring them to a database, but I gave up on this because I had too many subtitle files. Additionally, there were already some database operations running on the server where the service was operating, and adding this would strain the server, so I opted for a different solution: grep:
In Linux, we generally use the grep tool by piping outputs: For example, cat file.txt | grep 'value' or ls /usr/lib | grep 'value'. However, we can also search within the contents of files using grep. When regular expressions come into play, it makes things quite convenient.

A few parameters

One of the first commands I tried was as follows.
Here, the -r parameter searches through files in all subdirectories under the given directory. Without it, it would only search the files directly under the given directory.
Adding the i parameter allows it to search for the word regardless of case sensitivity, so it doesn't matter how the user inputs it.
At this point, I was using the @ladinu/node-grep tool, but at some point, I decided there was no need for it and instead used NodeJS's own child_process.exec function.

Creating a limit

Finally, what I needed to do was create a limit for the outputs. When searching within subtitles, there could be thousands of matches for some values. Grep's -m parameter (--max-count) could set a limit, but it couldn't be used with the -r parameter. I tried to solve it a bit differently by combining it with the find command.
The find command searches for directories rather than files. The reason for combining it with grep is that find allows us to define a limit. However, while doing this, I tried to connect a few different tools in Linux: expect, timeout, head.
Run the grep -r "hello" command within all directories under the subs directory and capture it with the expect command. The expect command allows user input in Linux. Here, it lets us connect grep's outputs to find.
Of course, the code isn't just this; right now, we are only searching. Now we will create a limit. Here comes: head.
With the head command, we read the head (first lines) of a file. By giving a number, we determine how many lines to read.
What we do here is this: find produces outputs and sends new lines to the screen until it hits the head command's limit. If it exceeds this limit, find continues to run. Since this is something we don't want, we add the -quit parameter.
Thus, with this command, we can perform an effective search in large text files.

What about timeout?

As the number of files increases and their sizes grow, making a specific search or searching for a non-existent value will be unavoidable with the above method. Therefore, returning an answer (even if empty) will occur after searching through all the files. This means a direct correlation between the number of files and answer size.
To prevent this, the best method is to set a timeout. We can do this with the timeout command.
If it can't find at least 10 values for the given term within the files in 5 seconds, it will return a terminated output. This method can be used as a practical solution.

A more logical solution

Yes... I am writing this post in the evening after work while enjoying some tea.‌ The solution I described above is working well right now: subflix.now.sh
While I was trying to explain the solution I implemented in this post, my main goal was to understand what I was doing and what I was missing, and I found an easier solution while writing: merging all subtitle files. Yes, instead of dealing with commands, pipes, and timeout nonsense, I can give grep a single file and use the -m parameter since I won't be using the -r parameter.
I will continue this post after making this improvement in the project.

I'm writing this in the evening of the same day. I'm really amazed at why I hadn't thought of such a simple solution before. I think when we see some problems as complicated, we assume their solutions will be as complicated as the problems themselves and try to solve them in difficult ways. We need to think simply.
With this command, I combine all .vtt files under the subs directory into a file named merged, and now I can perform a non-recursive search with grep!
The response speed for non-specific queries is now much better. However, for rare or non-existent words, it still scans the entire file (373MB), so the time increases as the size grows. I will look into how I can reduce this time. Even if I can't solve it, the current situation is quite satisfactory.
That's all for now.

I found another solution related to character encoding. The speed has doubled.
The forum notes that using regex in this situation is not recommended, but fortunately, I am not using regex.
 

© Samet 2017 - 2024