Science is swimming in data. And, the already daunting task of managing and analyzing this information will only become more difficult as scientific instruments — especially those capable of delivering more than a petabyte (that’s a quadrillion bytes) of information per day — come online.
Tackling these extreme data challenges will require a system that is easy enough for any scientist to use, that can effectively harness the power of ever-more-powerful supercomputers, and that is unified and extendable. This is where the Department of Energy’s (DOE) National Energy Research Scientific Computing Center’s (NERSC’s) implementation of SciDB comes in.
At Milpitas-based flash memory storage and software company SanDisk Corp., Nithya Ruff, director of the company’s open source strategy, is a huge driver behind science, technology, engineering and math initiatives to get more girls interested in the field. After growing up in Bangalore, India, Ruff learned to code at North Dakota State University, where she earned her computer science master’s degree.
But inside the spacecraft's Linux-based flight software, a problem was brewing. Every 15 seconds, LightSail transmits a telemetry beacon packet. The software controlling the main system board writes corresponding information to a file called beacon.csv. If you’re not familiar with CSV files, you can think of them as simplified spreadsheets—in fact, most can be opened with Microsoft Excel.
And Julia is a big deal — it’s a free alternative to proprietary tools for doing data science, like MathWorks’ MATLAB and Wolfram’s Mathematica, and it’s more contemporary than open-source languages R and Python. More companies are hiring data scientists to make more data-driven decisions, and open-source tools often come in handy.
In October 2014, Databricks participated in the Sort Benchmark and set a new world record for sorting 100 terabytes (TB) of data, or 1 trillion 100-byte records. The team used Apache Spark on 207 EC2 virtual machines and sorted 100 TB of data in 23 minutes.
In comparison, the previous world record set by Hadoop MapReduce used 2100 machines in a private data center and took 72 minutes. This entry tied with a UCSD research team building high performance systems and we jointly set a new world record.