A Fake Shell for Pangenomics

(cs.cornell.edu)

24 points | by matt_d 4 days ago

2 comments

epistasis 20 minutes ago
> It might seem odd to prefer shell scripting over a full-featured dynamic scripting language, but shell scripts like this have some material advantages over Python:
And thus 99% of bioinformatics pipelines are shell at their heart... You need 10 packages, written in 4 different programming languages, and the common interfaces are files and pipes.
And for that matter, this could use a named pipe rather than a file (assuming `odgi depth` only uses streaming access):
```
    odgi depth -i chr8.pan.og -r chm13#chr8 | \
        bedtools makewindows -b /dev/stdin -w 5000 > chm13.chr8.w5kbps.bed
    
    odgi depth -i chr8.pan.og -b chm13.chr8.w5kbps.bed --threads 2 | \
        bedtools sort > chr8.pan.depth.w5kbps.bed
```
And Bash process substitution allows writing it all without an explicitly named pipe, though it may look a bit ugly:
```
    odgi depth -i chr8.pan.og -b --threads 2 <( \
            odgi depth -i chr8.pan.og -r chm13#chr8 | \
            bedtools makewindows -b /dev/stdin -w 5000
        ) | \
        bedtools sort > chr8.pan.depth.w5kbps.bed
```
Which is why bioinformaticians get bad reputations with software engineers. (I still have a fair amount of misplaced pride for adding a shebang to a Makefile once to make a pipeline into a command several decades ago...)
gianiac 1 hour ago
I really like the IR-based approach, it solves something that's always bothered me about shell pipelines: you're forced to think in terms of serializing bytes, even when both ends of the pipe are the same program and could just share memory. Flash makes that optimization explicit and easy to compose with the rest of the pipeline. One question, though: have you run into any issues with the "opportunistic" binary format substitution (the .flatgfa fallback) when scripts are shared across machines where some files have already been converted and others haven't?