Kevin Jardine's blog

Making a galaxy map, Part 4: Enclosure

Finding structure in Gaia DR2 is like mapping foggy terrain. We may not know exactly how high the peaks go but we can find the mountain ranges nonetheless.

As I mentioned in my last blog post, because of strong dust extinction and reddening in some regions within 3000 parsecs as well as data selection effects and perhaps instrument magnitude limits, two regions that in reality may have similar hot star density may have very different density values in Gaia DR2. In this blog post I describe a technique using structural enclosure analysis for mapping star concentrations using density isosurfaces in the absence of precise density values.

To explain the enclosure technique I'll first step back a dimension and look at how we can analyze structure in contour lines on a topographic elevation map.

Consider the topographic map on the left side of this image:

If we are only interested in the structure of the region, we can remove the contour lines within individual peaks to get the simplified map on the right side.

We can represent the structure of this map using a tree diagram:

which just tells us that there are four peaks, one of which has two subpeaks. The tree diagram removes all the shape, size and location data but reveals the basic map structure.

It turns out that trees can represent any kind of enclosure, regardless of dimension. So they can be used to represent the enclosure structure of hot star density isosurfaces as well.

Another way to represent a tree is through enclosed circles. This circle diagram represents exactly the same tree. This diagram also has no location, shape or size data, but can be a clearer way to represent a complex tree:

The wonderful D3.js JavaScript library has a system to display any tree as a set of enclosed circles, so on this site I've put a circle enclosure diagram for the entire set of connected hot star isosurface regions with a density of 5% or more and with each region containing 5 or more stars (click to view the entire diagram in a separate window):

The background colour represents the star density and the circle size represents (very roughly) the number of stars it contains. Hovering your mouse over a region (when displayed on its separate page) gives the name of the region and a star count. The name of the region consists of the two numbers separated by a dash. The first number gives the density mesh the region is found in and the second number gives a sequential number for the connected regions found in the mesh.

This enclosure diagram is a detailed guide to the structures contained inside the 3D print I showed at the top of my previous blog post.

It turns out that the isosurface regions have some unusual enclosure properties that we can exploit to create our map.

Before I describe those, I think it would be helpful to view some annotated details of the above image:

On the left side:

On the right side:

We can immediately see density problems in these images. For example, the left image implies that there are more hot stars in the nearby southern stars surrounding the Sco OB2 and Vel OB2 associations in the solar neighbourhood than in the Carina region, one of the largest star formation regions known in the Milky Way. Clearly this is not true - it is an artifact of the star selection process, most likely caused by dust reddening, the restriction to very low error parallaxes, or perhaps instrument magnitude limits.

For this reason, a star map based purely on Gaia DR2 star density values would be highly misleading. Fortunately, a careful examination of the enclosure diagram suggests an alternative approach.

We can see that the density distribution is found in three distinct kinds of regions. First, except for a dense concentration in the solar neighbourhood, all the isosurface meshes contain one large region (which I call a plateau) and numerous smaller denser regions (which I call ranges after mountain ranges). The remaining regions occur as density peaks within the ranges.

The plateaus, being large enclosing structures, do not provide the kind of detailed site specific information that the ranges and peaks do, so I excluded them from the map. Looking at the list of ranges greatly simplified the mapping process - although there are about 33 thousand connected regions with a density of 5% or more, there are only about 6 thousand ranges. So this was more than an 80% reduction in the regions that needed to be considered for the map. Most of the connected regions turned out to be peaks contained within these 6 thousand ranges.

6 thousand ranges would still create too much clutter for a map. Without being able to use density, I needed an alternative criteria to select the most important ranges.

I decided to select ranges based on the number of ionizing stars they contained. Well, not exactly - what I actually did was to count the number of stars with a Gaia DR2 colour index less than -0.2. According to Eric Mamajek's very useful stellar colour table, the O-B3 ionizing stars have colour indices lower than this limit in the standard B-V colour system.

It actually seems unlikely to me that the Gaia DR2 colours are precise enough for a specific limit to select ionizing stars, but as you will see in the next blog post, I used an independent list of known ionizing stars as a sanity check for the map.

After some experiments, I split the ranges into three groups: the very ionizing ranges that contained 10 or more stars with ionizing colours, the some ionizing ranges that contained 4 to 9 stars with ionizing colours, and the rest with less than 4 stars with ionizing colours.

I then split the ranges with few or no stars with ionizing colours into three groups based on the total number of hot stars they contained: those with 50 or more, those with 25 to 49 stars, and those with less than 25 stars.

As the last group of ranges with less than 25 stars consisted of many small star clumps that tended to clutter up the map, I did not use those, much as a map of a continent does not attempt to show every town or village.

I also left the set of non-ionizing ranges with 25 to 49 hot stars off the face-on map poster, but it is available as a separate download for Gaia Sky.

For each set of ranges, I also created a set of peaks. After some experimenting, I selected the peaks inside each range with 50 or fewer hot stars. This made it possible to render translucent ranges containing several smaller denser peaks.

I will provide links to the four mesh sets (each containing one range file and one peak file) next week as part of the last blog post in this series.

Read more about Making a galaxy map, Part 4: Enclosure

Making a galaxy map, Part 3: Meshing

A 3D print of the largest connected region in the 5% mesh.

A mesher takes scalar field slices (such as described in my previous blog post) and a scalar value v and outputs an isosurface mesh such that the surface has value v everywhere, typically using the Marching Cubes algorithm. In this case, these are isosurfaces of constant density.

I used the SliceCubes() function in the Python version of the VTK library to convert the slices to meshes using the VTK implementation of the Marching Cubes algorithm.

I computed a mesh for each density value from 0.99 (99%) to .05 (5%) in 0.01 (1 percent) increments, or 95 meshes in all. The lowest density value of 0.05 (5%) corresponds to 3.33 x 10**(-4) stars per 3x3x3 bin.

I kept a list of the stars contained in each connected region (for future data mining) and output each connected region containing 5 or more stars as a separate mesh.

A typical density mesh may contain hundreds of connected regions. For all the density levels combined, there were 33239 connected regions.

The largest connected region in the 5% density mesh contains 297498 stars, which is 97% of all the stars in the 5% density mesh and 88% of all the stars in the cylinder being mapped. I made a 3D print of this region (displayed in the picture at the top of this blog post).

All of the regions in the galaxy map are inside the 5% density mesh and hence essentially all are inside the 3D print shown above.

But how to choose which of the more than 33 thousand regions to include in the map?

The regions are nested in the sense that all the regions with a density value v1 are contained in regions with density value v2 if v1 > v2.

A medical render with two translucent density levels.

Typically, for example, in MRI scans, this density difference (for example between bone and other tissue) is used to render a meaningful display by selecting all the regions from a small range of density values (perhaps 2 to 4) and displaying them using nested translucent meshes.

Indeed I used a similar technique to display TGAS regions from Gaia DR1.

But now we reach the fundamental problem with mapping star distribution using density isosurfaces in Gaia DR2. With medical applications such as MRI scans or in regions of low extinction such as the solar neighbourhood, the density values are well defined. This is not the case for Gaia DR2. Because of dust reddening, instrument magnitude limits and other selection effects, two regions that in reality may have similar star density may have very different density values in Gaia DR2.

How do we map star concentrations using density isosurfaces when the real density values are uncertain?

In my next blog post I will describe how I solved this problem and present what I believe is a new technique for mapping density isosurface regions in the presence of density uncertainty.

Read more about Making a galaxy map, Part 3: Meshing

Making a galaxy map, Part 2: Slicing

Hot star density slice near the galactic plane

An example hot star density slice (in this case near the galactic plane).

In order to apply the isosurface mapping technique, we must convert a point cloud, such as the set of about 340 thousand hot O, B and early A class stars I described in my last blog post, into a scalar field, a real valued function defined in principle for every point in euclidean space (or at least in a defined grid). This scalar field is in some way a measure of star density. We can then extract isosurfaces of constant density from this scalar field, much as contour lines are lines of constant elevation on a topographic map of mountainous terrain on Earth.

In order to make the necessary calculations manageable on my 16 Gb workstation, I placed the density values into 3x3x3 parsec bins.

I represented the cylinder mentioned in my last blog post (with a radius of 3000 parsecs and height 600 parsecs above and below the galactic plane) as a subset of a 6000 x 6000 x 1200 pc box and reduced this into 2000 x 2000 x 400 bins (with dimensions 3x3x3 pc) in a numpy array.

I then counted the number of stars in each bin. This created a scalar field of sorts, but a very discontinuous one.

I borrowed the gaussian smoothing technique mentioned in H. Bouy and J. Alves 2015, "Cosmography of OB stars in the Solar neighbourhood". For the map I chose a a gaussian sigma of 15 parsecs. The choice of a gaussian sigma is as much an aesthetic choice as it is a scientific one. The higher the sigma, the less detailed the map. A low sigma results in a chaotic map. A high sigma results in a smooth map with limited detail. I experimented with several values before settling on 15 parsecs.

Fortunately, the Python library scipy.ndimage.filters has an appropriate gaussian_filter function that could be applied directly to the numpy array.

In addition, for ease in future processing, I wanted to ensure that the values of the density function ran between 0 and 1.

There are numerous ways to normalize the density values in this way (for example dividing by the maximum value). The option I chose is to use this variation of the sigmoidal logistic function:

f(x) = 2/(1+exp(-kx)) - 1

which ranges between 0 and 1 in an S-shaped curve as x ranges between 0 and infinity.

This sigmoidal value stretching / normalization is often used in image processing to continuously compress data with very high and low values into a smoother middle range. As with the gaussian sigma value, the logistic k value is partly an aesthetic choice. A low k value pushes the function towards 0. A high value pushes the function towards 1. After experimenting with various values for k, I selected 300. This seemed to more evenly spread the possible density values between 0 and 1.

The scipy.special library has an expit function that can be applied to numpy arrays.

I then sliced the cylinder values in the numpy box from top to bottom in 3 pc slices and output the results as coloured 8 bit images (for debugging the sigma and k values) and as 16 bit pgm values (needed for isosurface generation). One of the coloured slice images is at the top of this blog post.

Once the slices were done, I could move on to the next stage: meshing.

Read more about Making a galaxy map, Part 2: Slicing

Making a galaxy map, Part 1: Data

The public archive site can be used to download Gaia DR2 data.

Last week I released a map of the Milky Way within 3000 parsecs (10 thousand light years) of the Sun made using data from Gaia DR2 and several other sources as described on my previous blog post.

You can see a face-on web version here or view the full 3D meshes in the latest version of Gaia Sky.

This week I'm starting a series of blog posts to explain in detail how I created the map. I hope that these posts will be useful for others who might want to try the density isosurface technique to create their own maps. At the end of this series I'll link to relevant resources, including the map meshes, SVG and Blender files, and Python source code.

The Gaia DR2 archive provides parallaxes for more than a billion stars, but I used only a tiny fraction of these to create the actual map. First, I selected only the stars with the most accurate parallaxes in the Gaia DR2 data set, the 72 million stars with error/parallax < 0.1.

There is a field in the Gaia DR2 database that is the reciprocal of this value, parallax_over_error. So in practice I was selecting the stars with parallax_over_error > 10.

The main difficulty with extracting this data set from the Gaia archive was the time. For the main archive, there is a maximum query limit of 30 minutes and at the time I was using the archive, selecting the data was slow. I ended up splitting the data selection into 28 queries based on the parallax value. A sample query is shown in the image at the top of this blog post. It took most of a day to get all the data. I do wonder if the Gaia database administrators might consider allowing faster downloads for commonly used data subsets.

After acquiring the data, I applied a few quality cuts and selected only about 400 thousand very hot O, B and early A class stars using a colour index filter with colour index < 0 and added an absolute magnitude cut (absolute magnitude < 7) to filter out dim but hot non-main sequence stars like white dwarfs. (Actually the hot stars I was mapping all have an absolute magnitude much brighter than 7. I chose that number to compensate for some dust extinction as well.)

The Gaia DR2 database includes a raw colour index value bp_rp, and a second colour excess value, e_bp_min_rp_val, that attempts to correct for dust reddening.

So we can define colour index = bp_rp - e_bp_min_rp_val.

These values are a relatively simple estimate of what Gaia will ultimately be capable of - more accurate colour values will be provided in Gaia DR3.

Filtering to these hot, young stars was an essential part of making the map possible. Cooler, older stars tend to drift randomly from their sites of formation. I tried a number of experiments to map cooler stars without a great deal of success - they were either present in a large number of random clumps or, at lower density levels, just a few large areas.

In contrast, hot young stars are largely gathered in dense associations, mostly near the galactic plane. Their distribution, especially those regions containing the ultra hot O-B3 class ionizing stars, form a distinct (and hence mappable) pattern.

After the colour selection, I further filtered to all stars within a cylinder with a radius of 3000 parsecs and a height of 600 parsecs above and below the galactic plane.

Why the 3 kpc limit? The main reason is that stars with very accurate parallaxes (err/plx < 0.1) start to fade out beyond 3.6 kpc. I definitely wanted to include the Carina nebula, which is more than 2 kpc away. The further I extended the map, the less accurate the star positions became, and also, more pragmatically: most of the structures (eg. OB associations or HII regions) that astronomers have named turn out to lie within 3 kpc. So a 3 kpc limit allowed the most accurate map with names for most of the major regions.

The 600 parsec height limit was a bit arbitrary and was mostly to reduce memory consumption. This did result in some truncation for a few very low density regions. An interesting alternative map might ignore the thin disk and look for density structures in the thick disk and halo.

This distance cut gave me a final data set of about 340 thousand stars, ready for the next stage of density isosurface generation - slicing.

Read more about Making a galaxy map, Part 1: Data

A map of the Milky Way out to 3000 parsecs (10 thousand light years)

A detail from the face-on map of the Milky Way. The full map poster is is available on the web here.

Sometimes dreams do come true. Today I can announce a detailed map of the Milky Way out to 3000 parsecs or about 10 thousand light years from the Sun.

I developed this map with help from scientists with the European Space Agency's Gaia mission and researchers at Leiden and Heidelberg universities. It includes star density isosurfaces mapping the major concentrations of the hotter O, B and A class stars in the Gaia DR2 release, about 5000 extremely hot ionizing stars, dust clouds and HII regions. Even better, it is available both in a face-on form viewed from above the Milky Way and in a true 3D version in the latest version of Gaia Sky.

A more detailed description of the map is up at the ESA Gaia mission website here.

A printable pdf is available here. (Warning, even at A0 size, the smallest labels are tiny. Needs sharp eyes, a magnifying glass or printing at a custom size with a width greater than one metre.)

A zoomable pannable web version is available here.

There is a gray background in the face-on versions of the map that shows the approximate locations of the inner and outer galaxy.

You can download the latest version of Gaia Sky with 3D mesh support here.

A Youtube trailer showing a flight through the map, animated in Gaia Sky, is available here:

Many people have helped make this map possible. I'd like to mention a few now.

Anthony Brown, a professor at Leiden University and the chair of Gaia's Data Processing and Analysis Consortium (DPAC), has been an enthusiastic supporter of this project and was always available with suggestions and comments.

Toni Sagristà Sellés, the principal developer of Gaia Sky, added numerous features to Gaia Sky to support visualizing and animating the 3D meshes.

Stefan Jordan, a professor at Heidelberg University, the sponsor of Gaia Sky, and Manager for Outreach and Education for DPAC, has also provided support and encouragement.

And at the European Space Agency itself, Jos de Bruijne and Tineke Roegiers, two Gaia Mission scientists, provided advice and support at key points in this project.

Finally, I'd like to thank Bob Benjamin, professor at the University of Wisconsin - Whitewater, who has long encouraged my interest in galactic cartography and made useful suggestions for positioning the HII regions on the current map.

The map depends upon data from many researchers. Here are some key credits:

Gaia Data Release 2 data as available from the Gaia Archive.

The isosurface technique was first pioneered for Hipparcos by H. Bouy and J. Alves, as explained in "Cosmography of OB stars in the Solar neighbourhood".

The map also includes dust density isosurfaces computed by dust extinction values provided in a preprint article "3D maps of interstellar dust in the Local Arm: using Gaia, 2MASS and APOGEE-DR14" by Lallement et al.

Positions for OB associations are based on the computed median distances to their members in the data set as described by R.M. Humphreys and D.B. McElroy in "The initial mass function for massive stars in the Galaxy and the Magellanic Clouds".

Positions for HII regions are computed using the median distances for known ionizing stars as described here.

The map also includes an overlay of about 5000 known ionizing stars. Distances for these stars are derived from a catalogue associated with a preprint article by C.A.L. Bailer-Jones et al in "Estimating distances from parallaxes IV: Distances to 1.33 billion stars in Gaia Data Release 2".

Of course all errors are mine. As with any map in the early days of exploration, it is bound to have missing sections and regions with incorrect data. But I think that it is a good start. I'm excited about the current map and even more looking forward to the many new features and improvements that we'll be adding in the future.

I've been running this site for almost 14 years now but today feels like a new beginning.

Read more about A map of the Milky Way out to 3000 parsecs (10 thousand light years)

You are here