Skip to content

Instantly share code, notes, and snippets.

@lexnederbragt
Last active December 16, 2015 00:09
Show Gist options
  • Select an option

  • Save lexnederbragt/5344976 to your computer and use it in GitHub Desktop.

Select an option

Save lexnederbragt/5344976 to your computer and use it in GitHub Desktop.
An ipython notebook describing how to import a so-called AGP file and build a sqlite database for it with python. The notebook can be viewed here: http://nbviewer.ipython.org/5344976
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "agp_to_sqlite"
},
"nbformat": 2,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"source": [
"This little project was intended to help me learn new stuff about both python and sqlite. ",
"In addition, I may use this code for a larger project to extract information from finished *de novo* genome assembly files. ",
"",
"Source: https://gist.github.com/lexnederbragt/5344976",
"",
"The goal is to import a so-called AGP file and build a sqlite database for it with python. ",
"AGP is a format which descibres the layout of contigs and scaffolds in a *de novo* genome assembly:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"scaffold00001 1 8192 1 W contig00001 1 8192 +",
"scaffold00001 8193 8622 2 N 430 fragment yes",
"scaffold00001 8623 16524 3 W contig00002 1 7902 +",
"scaffold00001 16525 17553 4 N 1029 fragment yes",
"scaffold00001 17554 51672 5 W contig00003 1 34119 +"
],
"language": "python",
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"The file basically states that for 'scaffold00001', from position 1 to 8192 there is contig00001.",
"The next line describes a gap (nothing known except an estimate of the length of the inbetween part) of length 430 bp. ",
"And so on... ",
"",
"To get started and help me out along the way, I used google: ",
"- I googled \"SQLite Python\" and ended up using this site http://zetcode.com/db/sqlitepythontutorial/ ",
"- I googled \"import table from file in sqlite python\" and ended up using this site:",
"http://stackoverflow.com/questions/2887878/importing-a-csv-file-into-a-sqlite3-database-table-using-python",
"",
"Finally, if you want to know more about the AGP file format, check out http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification_v1.1.shtml ",
"or my blog http://contig.wordpress.com/2010/03/22/newbler-output-ii-contigs-and-scaffolds-sequence-files-and-the-454scaffolds-txt-file/",
"****"
]
},
{
"cell_type": "markdown",
"source": [
"Set up the environment"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"import sqlite3"
],
"language": "python",
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"path = 'some/path'"
],
"language": "python",
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"source": [
"A function for loading the agp file. For newbler assemblies these files usually are called '454scaffolds.txt'"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"def load_agp(path):",
" data = []",
" with open (path + '/454Scaffolds.txt') as fh:",
" for line in fh.readlines():",
" i = line.split()",
" # The 'gap' lines do not have a ninth column",
" # for these, add and empty column",
" if len(i) == 8:",
" i.append('')",
" data.append(i)",
" return data"
],
"language": "python",
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"source": [
"Load data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = load_agp(path)",
"# look at some lines",
"data[0:4]"
],
"language": "python",
"outputs": [
{
"output_type": "pyout",
"prompt_number": 4,
"text": [
"[['scaffold00001', '1', '8192', '1', 'W', 'contig00001', '1', '8192', '+'],",
" ['scaffold00001', '8193', '8622', '2', 'N', '430', 'fragment', 'yes', ''],",
" ['scaffold00001', '8623', '16524', '3', 'W', 'contig00002', '1', '7902', '+'],",
" ['scaffold00001', '16525', '17553', '4', 'N', '1029', 'fragment', 'yes', '']]"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"source": [
"Generate an sqlite table"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# create table",
"# using :memory: puts the table in RAM instead of in a file on disk",
"con = sqlite3.connect(\":memory:\")",
"cursor = con.cursor()",
"# add column names (header)",
"columns = ('object', 'object_beg', 'object_end', ",
" 'part_number', 'component_type',",
" 'c_id_gap_len', 'c_beg_g_type', ",
" 'c_end_link', 'orientation',)",
" ",
"# create a table",
"cursor.execute(\"CREATE TABLE agp \"+ str(columns) + \";\")",
"",
"# alternative command:",
"#cursor.execute(\"\"\"CREATE TABLE agp",
"# (object, object_beg, object_end, ",
"# part_number, component_type,",
"# c_id_gap_len, c_beg_g_type, ",
"# c_end_link, orientation ) ",
"# \"\"\")"
],
"language": "python",
"outputs": [
{
"output_type": "pyout",
"prompt_number": 5,
"text": [
"<sqlite3.Cursor at 0x10451dc00>"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"source": [
"Add the data to the table. Format for the command: ",
"> cursor.executemany(\"INSERT INTO table_name (col1, col2, ...) VALUES (?, ?, ...);\", data) ",
"",
"where 'col1', 'col2', are the names of the columns and one '?' is needed for each column ",
"Here we use the tuple 'columns' for the column names and a 'join' construct for the nine required question marks :"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"cursor.executemany(\"INSERT INTO agp \"+ str(columns) + \" VALUES (\" + ','.join(('?')*9) + \");\", data)",
"con.commit()"
],
"language": "python",
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"source": [
"To look at the first four lines:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"with con:",
" cursor.execute(\"SELECT * FROM agp\")",
" rows = cursor.fetchall()",
" for row in rows[0:4]:",
" print row"
],
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(u'scaffold00001', u'1', u'8192', u'1', u'W', u'contig00001', u'1', u'8192', u'+')",
"(u'scaffold00001', u'8193', u'8622', u'2', u'N', u'430', u'fragment', u'yes', u'')",
"(u'scaffold00001', u'8623', u'16524', u'3', u'W', u'contig00002', u'1', u'7902', u'+')",
"(u'scaffold00001', u'16525', u'17553', u'4', u'N', u'1029', u'fragment', u'yes', u'')"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"source": [
"Now we can extract some data. As an example, we'll extract the gaps for scaffold00002 so we can calculate the total gap size:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"with con:",
" cursor.execute(\"\"\"SELECT c_id_gap_len FROM agp ",
" WHERE object = 'scaffold00002' ",
" AND component_type = 'N'\"\"\")",
" rows = cursor.fetchall()",
" total_gap_bases = sum([int(i[0]) for i in rows])",
" print \"%s bp in gaps\" % (total_gap_bases)"
],
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"47817 bp in gaps"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"source": [
"Let's define some functions"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"def total_gap_bases(object_id):",
" type = 'N' #indicating lines corresponding to gaps",
" with con:",
" cursor.execute(\"\"\"SELECT c_id_gap_len FROM agp ",
" WHERE object = ? ",
" AND component_type = ?\"\"\", (object_id, type))",
" rows = cursor.fetchall()",
" total_gap_bases = sum([int(i[0]) for i in rows])",
" return int(total_gap_bases)"
],
"language": "python",
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"def number_of_contigs(object_id):",
" type = 'W' #indicating lines corresponding to contigs",
" with con:",
" cursor.execute(\"\"\"SELECT c_id_gap_len FROM agp ",
" WHERE object = ? ",
" AND component_type = ?\"\"\", (object_id, type))",
" rows = cursor.fetchall()",
" return len(rows)"
],
"language": "python",
"outputs": [],
"prompt_number": 19
},
{
"cell_type": "markdown",
"source": [
"We use them for extracting the number of contigs and total gap length for a specific scaffold"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"scaffold = 'scaffold00002'",
"gap = total_gap_bases(scaffold)",
"num_ctg = number_of_contigs(scaffold)",
"print \"%s has %s contigs with in total %s bp in gaps\" % (scaffold, num_ctg, gap)"
],
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"scaffold00002 has 23 contigs with in total 47817 bp in gaps"
]
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"source": [
"Or we can do this for all scaffolds. First, we need a list of all scaffold IDs:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"def get_scaffold_IDs():",
" with con:",
" cursor.execute(\"SELECT object FROM agp\")",
" rows = cursor.fetchall()",
" # convert tuple of tuples to strings",
" rows = [i[0] for i in rows]",
" return list(set(rows))"
],
"language": "python",
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "markdown",
"source": [
"Now we can generate the results:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"scaffold_IDs = get_scaffold_IDs()",
"print '\\t'.join(['scaffold','contig_count','total_gap_length'])",
"for scf in sorted(scaffold_IDs):",
" print '\\t'.join([scf, ",
" str(number_of_contigs(scf)), ",
" str(total_gap_bases(scf))])"
],
"language": "python",
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"scaffold\tcontig_count\ttotal_gap_length",
"scaffold00001\t176\t209029",
"scaffold00002\t23\t47817",
"scaffold00003\t7\t10215",
"scaffold00004\t1\t0",
"scaffold00005\t2\t175",
"scaffold00006\t1\t0",
"scaffold00007\t1\t0",
"scaffold00008\t1\t0",
"scaffold00009\t1\t0",
"scaffold00010\t1\t0",
"scaffold00011\t1\t0"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"source": [
"This last example is not very optimal for large datasets, as each scaffold results in a sql call.",
"It would probably be better to extract the relevant columns from the database once and parse them afterwards.",
"",
"This was a very educational exercise for me, with great help from google. Enjoy!"
]
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment