# Browser bookmark files cleanup



## Beeblebrox (Mar 11, 2014)

I have a wild bunch of bookmark backups, ranging in format from json, html and even sqlite.

I'd like to text-search through all the files and pipe to a browser importable (ie firefox) text file all unique address links. I have NO concern for tags, time, or other metadata; only URL. Is something like
`$ grep <start text> "http" OR "www" <end text> "whatever" > out.file`
possible? Does grep work through places.sqlite files?

Other alternative, as advised previously in this forum something of the kind: `$ sort -u file1 | file2`
The problem here being that, sort command would not start differentiating from "http" or "www", so the result would be a mess. Therefore, solution might be a combination of (grep + sort) ?


----------



## Carpetsmoker (Mar 17, 2014)

> Does grep work through places.sqlite files?



Not directly, obviously, but you can use the sqlite commandline tool, ie:


```
[/data/code/nordavind]% sqlite3 db/db.sqlite3 'select name from albums order by name limit 5' | uniq
"...Famous Last Words..."
"...in Death of Steve Sylvester"
'Allelujah! Don't Bend! Ascend!
'Tage Mahal
(II)
```

To discover the database scheme, one could use:

```
[/data/code/nordavind]% sqlite3 db/db.sqlite3 '.schema' 
CREATE TABLE artists (
                id integer primary key autoincrement,
                name text not null
        );
[...snip...]
```



> grep <start text> "http" OR "www" <end text>



You can use an `or' in grep like so:
*grep -E '^(http|www)'*

The *^* anchors the searching to the start of the line (you may not want this?) And inside a group (*(*...*)*) you can use a pipe *|* as an `or' character. Note you need `extended' regular expressions for this (*grep -E* or *egrep*).



> sort command would not start differentiating from "http" or "www", so the result would be a mess. Therefore, solution might be a combination of (grep + sort) ?



To make www.site and http.site behave equal, you could strip these prefixes with *sed*
*grep -E '(www|http)' a | sed -E 's/^(www|http)//' | sort -u*


----------

