Arc Forumnew | comments | leaders | submitlogin
1 point by akkartik 5433 days ago | link | parent

Ah, here's a technical question: are you using arc for html parsing? I couldn't see how to, so I just built my crawler with python and beautifulsoup writing json.


1 point by thaddeus 5433 days ago | link

yep.

http://github.com/nex3/arc/tree/arc2.master/lib/http-get/

From there I just built some custom arc functions to parse the file out.

Here's an example:

  (def file-linefeed->list (path+file)
	(accum tolist 
	   (w/infile inf path+file 
		  (whiler line (readline inf) nil 
		      (tolist line)))))

   (= myfilelist (file-linefeed->list path+file))
Then just step through myfilelist and find what you're looking for.

Don't forget to use aws readline, or line returns will not load correctly.

-----

1 point by akkartik 5433 days ago | link

So it seems you don't need full-blown html parsing for your scraper.

-----

1 point by thaddeus 5433 days ago | link

Correct.

I just find the subset lines by finding start & end indicator points then write a custom parser for the subset section. I might be wrong, but for my needs a full-blown html parser would be much slower and I'm hitting the same file structure every time (for each stock).

-----

2 points by akkartik 5433 days ago | link

Yes, that def seems reasonable.

Arc is missing an html parser; I may take care of it. It doesn't have to be built in arc, just be callable from within arc.

-----

1 point by thaddeus 5433 days ago | link

I vaguely remember trying this out long time ago... which may help out.

http://github.com/nex3/arc/blob/arc2.master/lib/xml.arc

-----