UPDATE August 7th, 2017: All Hacker News submissions are now available on BigQuery, and the dataset is updated daily. If you are scraping Hacker News data at scale, it may be more efficient to use BigQuery instead.

An example query to get the top 2,000 Hacker News submissions:

#standardSQL
SELECT title, score
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
ORDER BY score DESC
LIMIT 2000

The web interface can only download up to 10,000 titles; you'll need to use an API to get more.

This repository contains simple Python scripts to download all Hacker News submissions and comments and store them in a PostgreSQL database, for use in ad-hoc data analysis. These scripts are optimized from the scripts used to gather data for my October 2014 blog post The Quality, Popularity, and Negativity of 5.6 Million Hacker News Comments. Parameters for connecting to the appropriate PostgreSQL database are set at the beginning of each file.

This script uses the older Algolia API for Hacker News (instead of the official HN API) due to its support for bulk requests and comment scores for most comments. Run-time of downloading and processing all Hacker News submissions is about 2 hours; run-time of downloading and processing all Hacker News comments is about 11 hours.

Example Queries

Average point score for HN submissions, by hour (EST) of submission:

SELECT EXTRACT(hour from created_at) AS hour, AVG(num_points) AS avg_points
FROM hn_submissions
WHERE num_points IS NOT NULL
GROUP BY hour

hour	avg_points
0	9.718
1	9.063
2	8.521
3	8.929
4	9.113
5	9.492
6	10.099
7	10.965
8	11.513
9	11.692
10	11.141
11	10.832
12	11.187
13	11.716
14	11.237
15	11.178
16	10.735
17	10.731
18	10.709
19	10.935
20	10.942
21	10.836
22	10.386
23	10.090

Number of users who have made atleast n comments, and the average point score for the nth comment a user makes:

SELECT nth_comment, COUNT(num_points) AS users_who_made_num_comments, AVG(num_points) AS avg_points
FROM (
	SELECT num_points,
	ROW_NUMBER() OVER (PARTITION BY author ORDER BY created_at ASC) AS nth_comment
	FROM hn_comments
	WHERE num_points IS NOT NULL
) AS foo
WHERE nth_comment <= 25
GROUP BY nth_comment
ORDER BY nth_comment

nth_comment	users_who_made_num_comments	avg_points
1	159410	2.432
2	99599	2.474
3	79467	2.550
4	68525	2.620
5	60921	2.648
6	55477	2.681
7	51091	2.685
8	47522	2.764
9	44498	2.795
10	41998	2.827
11	39931	2.869
12	37992	2.862
13	36282	2.820
14	34770	2.886
15	33403	2.937
16	32195	2.916
17	31073	2.903
18	30070	2.978
19	29126	2.950
20	28217	2.968
21	27372	2.950
22	26619	2.975
23	25949	3.044
24	25295	3.017
25	24651	3.040

Create the Hacker News leaderboard of users with the most karma, the hard way. (note that aggregated karma values will differ from true values due to vote obfuscation, among other things):

SELECT author, SUM(num_points) - COUNT(num_points) AS karma
FROM (
	SELECT author, num_points
	FROM hn_submissions
	UNION ALL
	SELECT author, num_points
	FROM hn_comments
) AS foo
WHERE num_points IS NOT NULL
GROUP BY author
ORDER BY total_points DESC
LIMIT 25

author	karma
tptacek	136777
pg	87380
ColinWright	76866
danso	57238
llambda	57105
fogus	55146
shawndumas	53092
patio11	51715
tokenadult	47853
ssclafani	46492
jgrahamc	45194
jacquesm	44717
cwan	44665
rayiner	41712
edw519	39716
DanielRibeiro	38530
luu	38035
ChuckMcM	37545
Libertatea	35177
evo_9	34585
lelf	34116
wglb	30763
aaronbrethorst	30220
raganwald	29993
anigbrowl	29875

Known Data Fidelity Caveats

Unfortunately, there are a few issues with the source data, which the scripts attempt to mitigate:

Hacker News automatically converts certain punctuation in Submissions/Comments contain into stylistic unicode (e.g. "smart quotes") which cannot be stored in the database; the scripts will convert the punctuation back to UTF-8.
Comments contain style and link HTML; the scripts attempt to strip it.
On the server-side, there are gaps of missing submission and comment data before 2010.
Comment scores are hidden server-size for comments after October 2014; this is coincidentally the month my blog post was published / the official API was published)