If you’re wanting to become a hacker, then text file manipulation is a skill set you will certainly need to become familiar with and efficient in. But this doesn’t mean that text manipulation is only for hackers. By understanding how and where we would manipulate text files on a computer system we can open up a whole new world. One reason that comes to mind almost instantly is automation. If we could get the computer to do in milliseconds what would take the average human hours to achieve, we can save so much time by leveraging tools to achieve these tasks.
What is text manipulation?
Text manipulation is the process of modifying text files to suit our needs.
This can be changing the first character of every word to uppercase. Maybe there’s a spelling mistake for a certain word through a whole document and we want to change every instance of the misspelled word for the correct spelling of that particular word. These are just two examples. Imagine how you would use automation to modify text files to make any type of change you want!
We all know that text files come in many sizes, from a short one page full of text or less, to many hundreds and even thousands of pages long.
Manually going through one page of text and editing to make changes doesn’t take a great deal of effort, but there’s always a chance of human error.
By utilizing the power of the linux command line we can modify words in any text file and of any size. It’s sometimes possible to achieve our text editing goals by using only one command, and it’s also possible for linux to achieve this in less than a second in some cases.
How is linux good at text manipulation?
Each command that we find in the linux operating system by default will do one job, and one job only. Except that, it will do that one job very well.
What we can do with these command line tools is ‘connect’ them together like lego bricks in any way we want. This is where the power of linux can be seen.
How we ‘connect’ these commands together is known as piping. the pipe symbol is “|”. (don’t confuse this with a lowercase L or any other character).
So what we do is we pipe the output from one command, and use it as the input for our next command.
Another component of linux that we can use is the right-angle bracket “>”. By default, when we execute a command on the linux terminal, it will output the result of it’s command to the screen for you to see. But when we want to create files, we can re-direct the output of a command with > followed by the name of a file. Please note: if the file exists, then the file WILL be overwritten! Sometime’s we want this to happen, other times we don’t. We can also use a double right-angle bracket “>>” to APPEND to the file. So in this instance, the file will not be over-written, but the result of the command will be added (appended) to the end of the existing file (if of course the file exists!).
Why would we use 'text manipulation' and where would we use it?
Consider the scenario: We are sat in front of a web page that’s waiting for us to enter a username and password. But there’s only one problem.. we’ve forgot the password (and maybe even the username). Now, rather than entering in a password manually and trying to guess it, we can use a wordlist file and “guess” what the password is. A cautionary note here: Do not attempt this on a system that you do not own or have explicit consent to do so! or you will most likely end up in a lot of trouble with the law.
Maybe a good place to start practicing this would be your home router for example.
The thing is with passwords, they’re case sensitive. so we will need several ‘versions’ of the same passwords. Some with uppercase characters, some with lowercase characters and every possible combination of the two.
As you can imagine, this would be very labour intensive work if we were to manually type out all of these possible passwords in to a list file. This is where the power of linux and text manipulation comes in to play.
Creating a word list
So how would we use text manipulation here in this scenario then? Let’s start off with a simple list of words, starting with the word ‘apple’ and ending in the word ‘zebra’. Our list will be a simple text file. Each word will be on it’s own line, in the file. So in this case, ‘apple’ will be at line 1.
We can have many words that start with the same character as it doesn’t make a difference here in this example. In fact, if you have any kind of usable word list then it certainly will have multiple words with the same first character!
Now the fun begins!
If ‘apple’ is our first word/line, we may want to create the words: Apple, APPLE, ApPlE, aPpLe. Now if we perform this action for ever line in our word file and append the results to our original wordlist, our wordlist will now be four times bigger! We need to take this file size into consideration when creating text files like this, as more often than enough we end up with Gigabytes of data (or more!).
Explanation of the text manipulation used in our example:
Apple: This is first char uppercase
APPLE: This is all uppercase
ApPle: This is alternate-uppercase
aPpLe: This is alternate-lowercase
Think about how many more possible combinations we could use. Then we could maybe add some numbers to the end.. and now imagine how big our resulting wordlist file would get!
Top 9 linux command line tools for text manipulation
Here I list the top nine linux command line tools that you should most certainly become comfortable in using, for your text manipulation tasks.
sort: As the name states, this will sort lists (lines) in alpha-numerical order.
cut: We can cut sections of text in a line.
uniq: Filter out any duplicate lines and leave us with only unique lines.
sed: We can find and replace strings (a line of text that we specify).
awk: A whole other “ball game”! this will take some time to understand.
head: Select the first few lines from the top of our list.
tail: Like head, only we select the LAST few lines from our list.
cat: This will feed in our raw text file to begin manipulating.
grep: this will find a specified string in each line of our text file.
Linux command line vs graphical tools
If you think that you can simply use a graphical tool to get the same result from using the command line then think again. Although some simple text manipulation is possible from ‘point and click’ tools, you would not get the power from linux unless you use the command line.
Example: Using bash to create an office document
In my other post ‘How to create an office document from a linux shell’ I take you through just how using text manipulation can automate the creation of an office document.
In this example, I use the sed command to find and replace strings in a text file.
cat wordlist_orig.txt | sed ‘s/change_this/to_that/g’ > wordlist_modified.txt
This command will enter our original text file in to sed, which will then replace a string for every instance it finds in the text. The result will be placed in to a new file called wordlist_modified.txt.
What is regex and how can we use it?
If you don’t already know by now, bash is usually the default shell interpreter in linux, but this would be perfectly valid for other shell interpreters too such as the korn shell (ksh) for example. Bash has regex capabilities built-in so this is one great tool we have in our arsenal of text manipulation tools.
Regex or ‘Regular Expression’ is a very powerful tool and is most certainly something to practice if you’re wanting to master text manipulation.
I won’t be covering regex here as I will be creating it’s very own post sometime in the future and I will link it here when complete.
Extending bash with programming languages
If you do find yourself becoming familiar with bash and using the linux command line very often then the most obvious programming language to learn would be perl. perl is also supported in bash and we can perform very complex pieces of code that would otherwise take many more lines of code if we were to just write our commands in bash alone.
I notice that when ever I look at job openings on job websites that are looking for someone who writes bash scripts, they will always favour someone who also knows perl, as bash and perl work so good together.
After perl then the next most obvious programming language to extend the linux bash would be Python.
Python isn’t written in bash scripts like how perl is and it will need to be scripted in it’s own file. However, in our bash scripts we can simply call the python script file to execute at any time we like.
Because Python is a high-level programming language and has a lot of support from the community, Python has multiple libraries that would also make complex code using much shorter commands than using bash alone would need.
Hopefully now you have a good understanding on just what ‘Text Manipulation’ is, and why using linux to perform any text manipulation would be a very good choice.
I would certainly recommend trying it out on the command line and create some text files of your own.
The example’s that I give here in this post is only the ‘tip of the iceberg’ so to speak. We could use find and replace text strings in web pages for example too!