Science and technology

How to port an awk script to Python

Scripts are potent methods to unravel an issue repeatedly, and awk is a superb language for writing them. It excels at simple textual content processing particularly, and it could possibly convey you thru some sophisticated rewriting of config information or reformatting file names in a listing. 

When to maneuver from awk to Python

At some level, nonetheless, awk’s limitations begin to present. It has no actual idea of breaking information into modules, it lacks high quality error reporting, and it is lacking different issues that are actually thought of fundamentals of how a language works. When these wealthy options of a programming language are useful to take care of a vital script, porting turns into possibility.

My favourite trendy programming language that’s good for porting awk is Python.

Before porting an awk script to Python, it’s typically worthwhile to think about its authentic context. For instance, due to awk’s limitations, the awk code is often known as from a Bash script and consists of some calls to different command-line favorites like sed, type, and the gang. It’s finest to transform all of it into one coherent Python program. Other occasions, the script makes overly broad assumptions; for instance, the code may enable for any variety of information, despite the fact that it is run with just one in apply.

After rigorously contemplating the context and figuring out the factor to substitute with Python, it’s time to write code.

Standard awk to Python performance

The following Python performance is helpful to recollect:

with open(some_file_name) as fpin:
    for line in fpin:
        cross # do one thing with line

This code will loop by way of a file line-by-line and course of the strains.

If you wish to entry a line quantity (equal to awk’s NR), you should use the next code:

with open(some_file_name) as fpin:
    for nr, line in enumerate(fpin):
        cross # do one thing with line

awk-like conduct over a number of information in Python

If you want to have the ability to iterate by way of any variety of information whereas retaining a persistent depend of the variety of strains (like awk’s FNR), this loop can do it:

def awk_like_lines(list_of_file_names):
    def _all_lines():
        for filename in list_of_file_names:
            with open(filename) as fpin:
                yield from fpin
    yield from enumerate(_all_lines())

This syntax makes use of Python’s turbines and yield from to construct an iterator that loops by way of all strains and retains a persistent depend.

If you want the equal of each FNR and NR, here’s a extra refined loop:

def awk_like_lines(list_of_file_names):
    def _all_lines():
        for filename in list_of_file_names:
            with open(filename) as fpin:
                yield from enumerate(fpin)
    for nr, (fnr, line) in _all_lines:
        yield nr, fnr, line

More complicated awk performance with FNR, NR, and line

The query stays in the event you want all three: FNR, NR, and line. If you actually do, utilizing a three-tuple the place two of the gadgets are numbers can result in confusion. Named parameters could make this code simpler to learn, so it is higher to make use of a dataclass:

import dataclass

@dataclass.dataclass(frozen=True)
class AwkLikeLine:
    content material: str
    fnr: int
    nr: int

def awk_like_lines(list_of_file_names):
    def _all_lines():
        for filename in list_of_file_names:
            with open(filename) as fpin:
                yield from enumerate(fpin)
    for nr, (fnr, line) in _all_lines:
        yield AwkLikeLine(nr=nr, fnr=fnr, line=line)

You may marvel, why not begin with this method? The purpose to begin elsewhere is that that is nearly all the time too sophisticated. If your objective is to make a generic library that makes porting awk to Python simpler, then take into account doing so. But writing a loop that will get you precisely what you want for a particular case is often simpler to do and simpler to know (and thus preserve).

Understanding awk fields

Once you have got a string that corresponds to a line, in case you are changing an awk program, you typically wish to break it up into fields. Python has a number of methods of doing that. This will return an inventory of strings, splitting the road on any variety of consecutive whitespaces:

line.break up()

If one other subject separator is required, one thing like this will break up the road by :; the rstrip technique is required to take away the final newline:

line.rstrip("n").break up(":")

After doing the next, the listing components can have the broken-up string:

components = line.rstrip("n").break up(":")

This break up is sweet for selecting what to do with the parameters, however we’re in an off-by-one error state of affairs. Now components[0] will correspond to awk’s $1, components[1] will correspond to awk’s $2, and many others. This off-by-one is as a result of awk begins counting the “fields” from 1, whereas Python counts from zero. In awk’s $zero is the entire line — equal to line.rstrip(“n”) and awk’s NF (variety of fields) is extra simply retrieved as len(components).

Porting awk fields in Python

As an instance, let’s convert the one-liner from “How to remove duplicate lines from files with awk” to Python.

The authentic in awk is:

awk '!visited[$0]++' your_file > deduplicated_file

An “authentic” Python conversion can be:

import collections
import sys

visited = collections.defaultdict(int)
for line in open("your_file"):
    did_visit = visited[line]
    visited[line] += 1
    if not did_visit:
        sys.stdout.write(line)

However, Python has extra knowledge constructions than awk. Instead of counting visits (which we don’t use, besides to know whether or not we noticed a line), why not report the visited strains?

import sys

visited = set()
for line in open("your_file"):
    if line in visited:
        proceed
    visited.add(line)
    sys.stdout.write(line)

Making Pythonic awk code

The Python group advocates for writing Pythonic code, which implies it follows a generally agreed-upon code model. An much more Pythonic method will separate the considerations of uniqueness and enter/output. This change would make it simpler to unit check your code:

def unique_generator(issues):
    visited = set()
    for factor in issues:
        if factor in visited:
            proceed
        visited.add(issues)
        yield factor

import sys
   
for line in unique_generator(open("your_file")):
    sys.stdout.write(line)

Putting all logic away from the enter/output code results in higher separation of considerations and extra usability and testability of code.

Conclusion: Python could be a good selection 

Porting an awk script to Python is commonly extra a matter of reimplementing the core necessities whereas enthusiastic about correct Pythonic code model than a slavish transliteration of situation/motion by situation/motion. Take the unique context into consideration and produce a top quality Python answer. While there are occasions when a Bash one-liner with awk can get the job finished, Python coding is a path towards extra simply maintainable code.

Also, in the event you’re writing awk scripts, I’m assured you’ll be able to study Python as effectively! Let me know when you’ve got any questions within the feedback.

Most Popular

To Top